Remote routing for automated operations management

ABSTRACT

Techniques are disclosed relating to automated operations management. In various embodiments, a computer system accesses operational information that defines commands for an operational scenario and accesses blueprints that describe operational entities in a target computer environment related to the operational scenario. The computer system implements the operational scenario for the target computer environment. The implementing may include executing a hierarchy of controller modules that include an orchestrator controller module at top level of the hierarchy that is executable to carry out the commands by issuing instructions to controller modules at a next level. The controller modules may be executable to manage the operational entities according to the blueprints to complete the operational scenario. In various embodiments, the computer system includes additional features such as an application programming interface (API), a remote routing engine, a workflow engine, a reasoning engine, a security engine, and a testing engine.

RELATED APPLICATIONS

The present application claims priority to U.S. Provisional Appl. No.62/840,892, filed Apr. 30, 2019, and U.S. Provisional Appl. No.62/774,811, filed Dec. 3, 2018; the disclosures of each of theabove-referenced applications are hereby incorporated by referenceherein in their entireties.

BACKGROUND Technical Field

This disclosure relates generally to operations management for computersystems.

Description of the Related Art

Historically, managing systems, such as ensuring that a service or aplatform is running and available, has involved carrying out various runlists (sequences of commands). Such run lists were typically long andtime-consuming for a user to manually enter the commands of the runlists into a command line. Eventually, software scripts were writtenthat traversed through the run lists, entering the commands instead ofthe user. Those software scripts, however, often crashed, leaving themanaged system in an unknown state. As a result, users still had tobecome involved by determining the current state of the system and thenresetting the system to a state where the software scripts could be runagain. Moreover, since the system generally involved multiplesubsystems, multiple run lists had to be maintained and carried out,each having their own steps and ways for managing the system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating example elements of a systemcapable of managing operational entities, according to some embodiments.

FIG. 2 is a block diagram illustrating example elements of anoperational entity, according to some embodiments.

FIG. 3A is a block diagram illustrating example elements of a definitionand a blueprint for an operational entity, according to someembodiments.

FIG. 3B is a block diagram illustrating example elements of an entitydescriptor for an operational entity, according to some embodiments.

FIG. 3C is a block diagram illustrating example elements of relationshipinformation for an operational entity, according to some embodiments.

FIG. 3D is a block diagram illustrating example relationships betweenexample operational entities, according to some embodiments.

FIG. 4 is a block diagram illustrating example elements of a controllermodule, according to some embodiments.

FIGS. 5 and 6 are flow diagrams illustrating example methods relating tomanaging the operational entities of a system, according to someembodiments.

FIG. 7 is a block diagram illustrating example elements of a controlAPI, according to some embodiments.

FIG. 8 is a block diagram illustrating example elements of a routingengine, a routing layer, and routable entities, according to someembodiments.

FIGS. 9 and 10 are flow diagrams illustrating example methods relatingto implementing an instruction associated with an operational entity,according to some embodiments.

FIGS. 11-13 are flow diagrams illustrating example methods relating torouting an instruction associated with an operational entity, accordingto some embodiments.

FIG. 14 is a block diagram illustrating example elements of a workflowengine, according to some embodiments.

FIGS. 15 and 16 are flow diagrams illustrating example methods relatingto implementing a workflow, according to some embodiments.

FIG. 17 is a block diagram illustrating example elements of a reasoningengine, according to some embodiments.

FIGS. 18 and 19 are flow diagrams illustrating example methods relatingto generating a workflow, according to some embodiments.

FIGS. 20A and 20B are block diagrams illustrating example elements of anauthorization service, according to some embodiments.

FIG. 21 is a block diagram illustrating example elements of anauthorization sheet, according to some embodiments.

FIG. 22 is a block diagram illustrating example elements of a tokencreated by the authorization service, according to some embodiments.

FIG. 23 is a flow diagram illustrating an example method relating to theauthorization service, according to some embodiments.

FIG. 24 is a block diagram illustrating example elements of a testingengine, according to some embodiments.

FIG. 25 is a flow diagram illustrating an example method relating to thetesting engine, according to some embodiments.

FIG. 26 is a block diagram illustrating an example computer system,according to some embodiments.

This disclosure includes references to “one embodiment” or “anembodiment.” The appearances of the phrases “in one embodiment” or “inan embodiment” do not necessarily refer to the same embodiment.Particular features, structures, or characteristics may be combined inany suitable manner consistent with this disclosure.

Within this disclosure, different entities (which may variously bereferred to as “units,” “circuits,” other components, etc.) may bedescribed or claimed as “configured” to perform one or more tasks oroperations. This formulation—[entity] configured to [perform one or moretasks]—is used herein to refer to structure (i.e., something physical,such as an electronic circuit). More specifically, this formulation isused to indicate that this structure is arranged to perform the one ormore tasks during operation. A structure can be said to be “configuredto” perform some task even if the structure is not currently beingoperated. A “network interface configured to communicate over a network”is intended to cover, for example, an integrated circuit that hascircuitry that performs this function during operation, even if theintegrated circuit in question is not currently being used (e.g., apower supply is not connected to it). Thus, an entity described orrecited as “configured to” perform some task refers to somethingphysical, such as a device, circuit, memory storing program instructionsexecutable to implement the task, etc. This phrase is not used herein torefer to something intangible. Thus, the “configured to” construct isnot used herein to refer to a software entity such as an applicationprogramming interface (API).

The term “configured to” is not intended to mean “configurable to.” Anunprogrammed FPGA, for example, would not be considered to be“configured to” perform some specific function, although it may be“configurable to” perform that function and may be “configured to”perform the function after programming.

Reciting in the appended claims that a structure is “configured to”perform one or more tasks is expressly intended not to invoke 35 U.S.C.§ 112(f) for that claim element. Accordingly, none of the claims in thisapplication as filed are intended to be interpreted as havingmeans-plus-function elements. Should Applicant wish to invoke Section112(f) during prosecution, it will recite claim elements using the“means for” [performing a function] construct.

As used herein, the terms “first,” “second,” etc. are used as labels fornouns that they precede, and do not imply any type of ordering (e.g.,spatial, temporal, logical, etc.) unless specifically stated. Forexample, in a processor having eight processing cores, the terms “first”and “second” processing cores can be used to refer to any two of theeight processing cores. In other words, the first and second processingcores are not limited to processing cores 0 and 1, for example.

As used herein, the term “based on” is used to describe one or morefactors that affect a determination. This term does not foreclose thepossibility that additional factors may affect a determination. That is,a determination may be solely based on specified factors or based on thespecified factors as well as other, unspecified factors. Consider thephrase “determine A based on B.” This phrase specifies that B is afactor is used to determine A or that affects the determination of A.This phrase does not foreclose that the determination of A may also bebased on some other factor, such as C. This phrase is also intended tocover an embodiment in which A is determined based solely on B. As usedherein, the phrase “based on” is thus synonymous with the phrase “basedat least in part on.”

DETAILED DESCRIPTION

Instead of using software scripts that enter commands from a run list,some developers have started to use large-scale deployment systems, suchas Kubernetes®, for managing their systems. Kubernetes® provides acontainer-centric management environment for automating the deploymentand management of containers, which are portable, self-sufficient unitshaving an application and its dependencies. Accordingly, developers mayuse Kubernetes® to deploy containers that include database servers, forexample. Kubernetes®, however, is deficient for managing an entiresystem for various reasons. Kubernetes® was designed to managecontainers on top of worker nodes (e.g., virtual machines). For example,Kubernetes cannot be used to manage hardware or information entities,such as a logical database or a configuration specification. Kubernetes®does not provide visibility and/or control over everything within acontainer. For example, Kubernetes® cannot interface with a databaseserver instantiated in a container. As a result, management objectives,such as starting up systems within a container, troubleshooting thosesystems, and gathering metadata on those systems, are not viable usingKubernetes® alone. Other approaches, such as Spinnaker®, share the samedeficiencies with Kubernetes® as they lack the ability to control thefull range of components that can be found within a system.

While managing systems has traditionally been terribly difficult,testing those systems or troubleshooting issues that occur has beenequally difficult. As previously expressed, when a run list fails, it isusually difficult to discern the current state of the system as the runlist does not provide information about the state that it left thesystem in before crashing. Additionally, there is generally a great dealof complexity within a system that has to be tested. As a result, partsof a system may not be tested if overlooked due to the complexity of thesystem and those parts that are tested may not be tested thoroughlyenough.

The present disclosure describes techniques for managing systems thatovercome some or all of the deficiencies of prior approaches. In variousembodiments described below, the operational entities (e.g., databaseservers, logical databases, etc.) within a system are described in aformal, structured manner and are then subsequently managed bycontroller modules. Generally speaking, many of the operational entitieswithin a system are either hardware or information stored on hardware.If an operational entity is solely a physical, tangible component of asystem, then it is referred to within this disclosure as a “hardwareentity.” A physical server rack and a blade are examples of hardwareentities. If an operational entity is not hardware, then it may consistof information—i.e., data that is stored on hardware. This informationmay either be executable or non-executable. Information that isexecutable by a system to perform operations is referred to within thisdisclosure as a “software entity.” A database server instance, an alertserver instance, and a metric server instance are examples of softwareentities. On the other hand, information within a system that is notexecutable is referred to within this disclosure as an “informationentity” (or alternately, an “information-oriented entity”). A databasebackup image and a tenant construct that includes data for a tenant of amulti-tenant database are examples of information entities.

Those three entity types—hardware, software, and information—can beconsidered the “building blocks” for any operational entity that may befound within a system. For purposes of the present disclosure, anyoperational entity within a system can either be described using one ofthe three building block entity types, or using a combination entitytype (or alternatively, a “formation”) that includes two or more ofthese building blocks. One example of a formation entity may be anoperational entity for a “database system” as the database system mayinclude a processor and a storage medium (hardware entities), a databaseserver (a software entity) that executes on that processor, and alogical database (an information entity) that is managed by thatdatabase server. Another example of a formation entity may be anoperational entity for a “storage area network” as the storage areanetwork may include storage mediums, network switches (which themselvesmight be formations of hardware and software entities), and data objects(information entities) that are stored on the storage mediums.

In various embodiments, the intended/expected state for a system isinitially defined the state that an operator of the system wishes thesystem to be in. Defining the intended state for a system may involvecreating definitions and blueprints that define the various operationalentities that make up the system along with the relationships betweenthose operational entities. These definitions and blueprints may followa common schema that provides a structured way to describe operationalentities. As will be explained further, definitions and blueprints mayconvey information to a controller module that enables that controllermodule to manage the operational entities corresponding to thedefinitions and blueprints. For example, if a controller module ismanaging a database service operational entity, then the controllermodule may learn from a blueprint linked to that database service thatthe database service should include three running database servers. Ifthe controller module observes that the database service includes onlytwo running database servers, then the controller module may start athird database server to reach the intended state of that databaseservice entity.

That transition from two running database servers to three runningdatabase servers can be viewed as a state transition of the systembetween two states. Accordingly, the operational management of a systemcan be viewed as or compared to a state machine in which the system canexist in and transition through different states. As mentioned, thedefinitions and blueprints may define the intended state for entitieswithin the system and thus the system as a whole. In variousembodiments, the controller modules of the system transition the systemfrom one state to another state until the system arrives at the intendedstate. The controller modules may then continue monitoring the system toensure that the system remains in the intended state. If the systemleaves the intended state (e.g., an entity crashes), the controllermodules may implement one or more commands to move the system back tothe intended state by issuing instructions to components within thesystem. In some cases, a command may be written in a manner that allowsfor it be read by a user (i.e., a human-readable command) and thus acontroller module may translate that command into an instructionunderstandable by components in the system; in some cases, a command maybe the understandable instruction and thus a controller module may nothave to translate it—the controller module may issue the command as theinstruction to a component. As used herein, the term “component” isintended to encompass operational entities and controller modules. Thus,the term “component” can refer to an operational entity or a controllermodule.

To facilitate transitions between states of the system, in variousembodiments, a control application programming interface (API) isimplemented that provides a way to understand the current state of thesystem and to make changes to the system if and when needed. The controlAPI may provide structured access to the operational entities in asystem through a set of API calls that provide a mechanism for learningabout an operational entity's state and for invoking functionalitysupported by that operational entity. In various embodiments, controllermodules host the control API and thus enable users and/or othercontroller modules to have access to the operational entities managed bythose controller modules via the control API.

In various embodiments, the controller modules and the operationalentities in a system may form a hierarchy where an orchestratorcontroller module may reside at the top level of the hierarchy and mayissue instructions (which may include control API calls) down throughthe hierarchy to controller modules and operational entities that residein lower levels. As an example, the orchestrator controller module mayreceive, from a user via a command line tool, a command pertaining to aparticular entity. Accordingly, the orchestrator controller module mayroute an instruction (that is based on the command) through the levelsof the hierarchy to the managing controller module that may implementthe instruction by making an appropriate control API call to theparticular entity. That call may cause the particular operational entityto transition to another state.

This paradigm permits operational scenarios to be implemented. Broadlyspeaking, the term “operational scenario” is used herein to refer to asequence of steps used to perform some action. For example, oneoperational scenario might include starting up a database service andanother operational scenario might include updating a database service.Operational scenarios may be implemented using a workflow that includesan ordered set of commands, which may be carried out via a set ofcontrol API calls. For example, a workflow for updating a databaseservice might include a command for transitioning operational entitiesof the database service from “online” to “offline,” a command fortransitioning them from their current version to the updated version,and a command for transitioning them from offline back to online. Insome embodiments, the orchestrator controller module accesses workflowinformation and issues instructions based on the workflow informationdown through the hierarchy to make changes to the appropriateoperational entities in order to complete the corresponding workflow.

An operational scenario may alternatively be implemented by defining aset of intended states for the appropriate entities. An orchestratorcontroller module may then determine, from the set of intended states,commands for reaching those states. For example, an intended state (orgoal) might be to have a running database service that includes threedatabase servers. Thus, an orchestrator controller module may generatecommands having control API calls for starting up three databaseservers. This process can be referred to as the orchestrator “reasoning”about the intended state. This reasoning thus allows a user, in someinstances, to define a goal without having to articulate the specificsteps or actions needed to achieve that goal, which may instead bedetermined by “reasoning” performed by the orchestrator. In some cases,this approach may be more robust than defining workflows since thesystem might end up in an intended state on its own (e.g., thedisappearance of a container might fulfill the intended state of nothaving that container).

These techniques may be advantageous over prior approaches as thesetechniques allow for entire systems to be described and then managed inan automated manner. Prior approaches such as Kubernetes® allow forcontainers to be defined and instantiated, but do not provide amechanism for controlling the full range of components in a system(e.g., hardware, software within a container, information constructssuch as logical databases, etc.). Additionally, these techniques allowfor the intended state of a system to be defined and thus allow for thesystem to be controlled in an automated fashion that reduces reliance onhuman intervention to manage the system. That is, controller moduleswithin the system may continually monitor the system to ensure that thesystem is in the intended state. If the system changes to a different,undesired state, controller modules may transition the system back tothe intended state without human intervention in many cases. Thisautomated fashion can reduce the number of humans involved in managingthe system. Moreover, the use of a common format for describing entitiesmay simplify operations, increase the ability to test operationalscenarios, and reduce the amount of code needed to manage software in aproduction environment. These techniques may further be applied tomutable deployments, in which the deployment can be changed (e.g., byadding a set of nodes to a pool of application servers) withoutrecreating that entire deployment, and to immutable deployments (e.g.,where an entire deployment is recreated for each change to thedeployment). These techniques also provide integrated fault testing(versus having to use a completely separate tool), integrated security,and integrated troubleshooting.

Turning now to FIG. 1, a block diagram of a system 100 is depicted. Inthe illustrated embodiment, system 100 includes operational entities110, controller modules 120 (including an orchestrator controller module120), a database 130, and an authorization service 140. Also asillustrated, database 130 includes operational information 135 andauthorization service 140 includes a test engine 150. In someembodiments, system 100 may be implemented differently than shown. As anexample, system 100 may include multiple orchestrator controller modules120, another level of operational entities 110 and/or controller modules120, multiple databases 130, etc.

An operational entity 110, in various embodiments, includes one or moreelements and a collection of information relating to those elements.Examples of elements may include, for example, a physical processor,physical memory, a virtual machine, a virtual machine image, a logicaldatabase, a database snapshot, a container, a container image, adatabase service, an operating system, a workflow, a database center, anetwork boundary/domain, etc. As discussed earlier, there are threebasic types of operational entities 110: hardware entities 110, softwareentities 110, and information entities 110. An operational entity 110'stype, in various embodiments, is dependent on what elements make up thatoperational entity. For example, an operational entity 110 that includesonly a physical processor is considered a hardware entity 110. Thesethree basic types may be used to make formation operational entities110. A formation entity 110 is a collection of two or more entities,each with zero or more relationships with the other entities. An exampleof a formation entity 110 is a database system entity 110 that includesa processor and a storage medium (hardware entities 110), a databaseserver (a software entity 110) executing on that processor, and alogical database (an information entity 110) managed by that databaseserver.

An operational entity 110 may include or be associated with a collectionof information that describes that operational entity. The informationmay include: a definition that may define what elements and variablescan be used to make up a particular species of operational entity 110;and a blueprint that may define an instance of that species ofoperational entity 110. For example, a definition may define a databaseservice entity 110 as including database server entities 110 while ablueprint may define a particular database service entity 110 asincluding 15 database server entities 110. In various embodiments, theinformation that is associated with an operational entity 110 furtherdefines functions of a control API. Such information may be used by acontroller module 120 to learn about what functions may be called for anoperational entity 110 to manage that operational entity, such as bytransitioning that operational entity to different states.

A controller module 120, in various embodiments, is a set of softwareroutines that are executable to manage a set of operational entities 110and/or controller modules 120. In some embodiments, controller modules120 are defined in a generic manner such that each controller module 120in system 100 supports the same functionality, although they may servein different roles within system 100. Controller modules 120 may alsowork in a variety of environments, including bare metal, Amazon WebServices® (AWS), and Kubernetes®. For example, in a Kubernetes®environment, a controller module 120 may serve as a Kubernetes® operatorthat interacts with other controller modules 120 that are withindatabase containers. On AWS, the AWS cloud may be defined as anoperational entity 110 that dispenses virtual machines entities 110.Inside each virtual machine entities 110 may be a controller module 120that manages the contents (operational entities 110) of that virtualmachine entity.

In order to manage operational entities 110 and/or controller modules120, a controller module 120 may have access to a control API for eachoperational entity 110 under its control or authority. In various cases,the control API calls of the control API may be identical for alloperational entities 110, although the implementation of the functionsfor those control API calls may be different between operationalentities 110. Through issuing control API calls, a controller module 120may obtain information pertaining to an operational entity 110 (e.g., ablueprint) and may transition that operational entity between states. Asan example, a controller module 120 may send a “transition” control APIcall to a database server entity 110 to transition the database serverelement from offline to online.

In various embodiments, a controller module 120 maintains informationrelating to the operational entities 110 and/or controller modules 120managed by that controller module. In some cases, when initiated, acontroller module 120 may access a properties file that provides initialinformation to that controller module 120; such information may identifya port number to listen on and any operational entities 110 that areunder the control of the controller module 120. The informationmaintained by a controller module 120 may also include information thatis gathered from operational entities 110, such as blueprints,definitions, and advertised control API calls that are supported bythose operational entities.

As mentioned earlier, operational entities 110 and controller modules120 may form a hierarchy. The controller module 120 at the top of thathierarchy is referred to as an orchestrator controller module. Theorchestrator controller module 120 may be tasked with implementing ahigh-level goal and thus may orchestrate other controller modules 120 inother levels of the hierarchy to achieve that goal. As an example, anorchestrator controller module 120 might be tasked with updating anentire database fleet entity 110. As a result, the orchestratorcontroller module 120 may coordinate the update of each database clusterentity 110 within the database fleet entity 110. This may leverage othercontroller modules 120 that manage the operational entities 110 withinthose database cluster entities 110. Those controller modules 120,however, may hide the details of what happens in updating those databasecluster entities 110 from the orchestrator controller module 120. Insome embodiments, system 100 may include multiple hierarchies, each ofthe hierarchies may include their orchestrator controller module 120.

Database 130, in various embodiments, is a collection of informationthat is organized in a manner that allows for access, storage, andmanipulation of the information. Database 130 may be implemented by asingle storage device or multiple storage devices that are connectedtogether on a network, such as a storage attached network, andconfigured to redundantly store information in order to prevent dataloss. Database 130 may include supporting software that permitscontroller modules 120 to perform operations (e.g., accessing, storing,manipulating, etc.) on information in database 130. In variousembodiments, database 130 stores definitions, blueprints, operationalinformation 135, and/or other information that pertain to theoperational entities 110 within system 100. When managing an operationalentity 110, a controller module 120 may access information from database130 so that it may properly manage that operational entity.

Operational information 135, in various embodiments, is a collection ofinformation defining operational scenarios for target environments 137.As noted, an operational scenario is a sequence of steps used to performsome action (or high-level goal). In some embodiments, operationalscenarios are defined in workflow documents, where a given workflowdocument may specify a set of commands for implementing the sequence ofsteps for the corresponding operational scenario. A given operationalscenario may be associated with a target environment 137 such that theoperational scenario may change a state of that target environment.

Target environment 137, in various embodiments, is a group ofoperational entities 110 and/or controller modules 120 that may beoperated on as part of an operational scenario. For example, in anoperational scenario that updates a database fleet entity 110 to a newerversion, the database fleet entity 110 (including all its operationalentities 110 and controller modules 120) is considered the targetenvironment 137 of that operational scenario. Another operationalscenario might involve a different target environment 137 havingdifferent operational entities 110 and/or controller modules 120.

Authorization service 140, in various embodiments, is operable toprotect system 100 from inappropriate use (whether it is malicious ornon-malicious) by authenticating (“who is trying to change state?”),authorizing (“are they allowed to change state?”), and auditing(recording the outcomes for the authenticating and authorizing) commandsbeing issued to controller modules 120 and/or operational entities 110.For example, service 140 may audit commands issued by a user toorchestrator controller module 120 in order to prevent performance ofany unauthorized issued commands. Service 140 may also audit commandsissued by orchestrator controller module 120 to other controller modules120 (or other controller modules 120 to operational entities 110) inorder to ensure a user has not attempted to gain authorized access bycircumventing orchestrator controller module 120. As will be discussedbelow with respect to FIGS. 20-23, in various embodiments, authorizationservice 140 maintains a set of security rules defining permissibleactions for implementing various operational scenarios within a targetcomputing environment and verifies that issued commands comply with thepermissible actions defined by the set of security rules.

Test engine 150, in various embodiments, is a test component operable toinject fault conditions into a system 100 in order to identify states inwhich system 100 fails to function properly. In general, these faultsmay pertain to crashes, hangs, errors, lock-step ordering issues, timeinjection, etc. For example, test engine 150 may disable a databaseserver being used by system 100 to see if such an action places system100 in a state in which it is unable to recover. As will be described ingreater detail below in conjunction with FIGS. 24 and 25, in variousembodiments, test engine 150 interfaces with one or more controllermodules 120 and/or operational entities 110 in order to determine thecurrent state of system 100. For example, test engine 150 may collectinformation about the current state of system 100 before injecting afault condition and then collect information about the current stateafter the injection in order to determine how the state of system 100has been altered. In some embodiments, test engine 150 may also monitorthe state of system 100 in order to inject particular fault conditionswhen particular commands are being issued by controller modules 120.That is, as requests flow through authorization service 140, test engine150 may use this point in the architecture for coordinating faultinjections (e.g., before a change, after a change, and during a change).For example, test engine 150 may determine, from a request beingprocessed by authorization service 140, that an operational entity 110is undergoing an update and then may attempt to inject a faultcondition, which may result in the update failing, in order to determinewhether system 100 is able to handle a fault condition during the updateprocess. In the illustrated embodiment, test engine 150 is shown asbeing integrated into authorization service 140 as such integration mayallow test engine 150 to have greater insight into the current state ofsystem 100 when various components ask service 140 for permission toperform various actions. In some embodiments, an external test systemmay interact with test engine 150 to orchestrator changes and injectfaults into system 100.

Turning now to FIG. 2, a block diagram of an operational entity 110 isdepicted. In the illustrated embodiment, operational entity 110 includesa blueprint 210, one or more elements 220, and a control APIimplementation 230. As shown, operational entity 110 interfaces with acontroller module 120. In some embodiments, operational entity 110 mayinclude a controller module 120 that interfaces with an externalcontroller module 120. As an example, operational entity 110 may be asoftware container having a controller module 120 that communicates withan orchestrator controller module 120. In some embodiments, operationalentity 110 may be implemented differently than shown—e.g., blueprint 210may be stored at database 130 instead of operational entity 110 (or itmight be stored at both entity 110 and database 130).

As explained earlier, in order to enable the operational entities 110 ofsystem 100 to be managed, in various embodiments, the entities 110 maybe described using blueprints 210 and definitions (which are discussedin more detail with respect to FIG. 3A) that contain information aboutthe operational entities themselves and their relationships with otherentities 110. Such information may convey, to controller modules 120,how different operational entities 110 may be managed.

Blueprint 210, in various embodiments, is a collection of informationdefining aspects of a specific implementation of an operational entity110. Blueprint 210 may define a desired or intended state for anoperational entity 110 that an administrator of system 100 wishes thatoperational entity 110 to exist in. For example, one particularblueprint 210 might describe a database fleet entity 110 as including 15database servers while another particular blueprint 210 might describe adatabase fleet entity 110 as including 10 database servers. As discussedin greater detail with respect to FIGS. 3A-3D, in various embodiments,blueprint 210 includes an entity descriptor that may define values for aselected set of attributes that are usable to manage an operationalentity 110, relationship information that may describe relationshipsbetween the operational entity 110 and other entities, andentity-specific variables that may be used for configuring theoperational entity 110. A blueprint 210 for an operational entity 110may be provided and/or altered by a user of system 100, a user whodeveloped that entity (e.g., the user who wrote the software), acontroller module 120, etc.—this may include any combination thereof.For example, a managing controller module 120 might alter versioninformation in the entity descriptor when updating the correspondingoperational entity 110 to a new version.

In various embodiments, blueprint 210 may be deployable to spawn aninstance of the operational entity 110 that is defined by thatblueprint. For example, an operator may provide a blueprint 210 for adatabase service entity 110—that blueprint may define the intended stateof that database service entity as having 15 database servers. Thecontroller module 120 that is responsible for deploying that blueprintmay observe the state of system 100 to determine whether the databaseservice entity 110 exists. If the database service entity 110 does notexist, then the managing controller module 120 may instantiate thedatabase service entity 110 according to its blueprint 210—e.g., bycommunicating with certain operational entities 110 that are capable ofspawning the 15 database servers.

In some instances, blueprints 210 may form a hierarchy whereimplementing a top level blueprint 210 may involve implementing lowerlevel blueprints 210. Accordingly, a blueprint 210 might includereferences to other blueprints 210. Returning to the previous example,the blueprint 210 for the database service entity 110 may include areference to a blueprint 210 for a particular implementation of adatabase server entity 110. As such, when instantiating the databaseservice entity 110, the managing controller module 120 may look up theblueprint 210 for the database server entity 110 via the blueprint 210of the database service entity 110 so that it can cause instantiation ofthe 15 database servers.

Elements 220, in various embodiments, include hardware (e.g., physicalprocessors and memory), software (e.g., database servers), informationconstructs (e.g., logical databases), or any combination thereof (e.g.,an element 220 might be an operational entity 110 that includes its ownset of elements 220 that are hardware, software, and informationconstructs). Examples of elements 220 include, but are not limited to, aphysical processor and memory, a top-of-rack network switch, anoperating system, a virtual machine, a virtual machine image, a databaseserver, a logical database, a database snapshot, a container, acontainer image, a workflow, a database center, a tenant snapshot, and atenant. In various cases, a controller module 120 may interface withelements 220 via control API implementation 230.

Control API implementation 230, in various embodiments, is a set ofsoftware routines executable to perform one or more functions of acontrol API (discussed in greater detail with respect to FIG. 7).Control API implementation 230 may serve as an interface betweenelements 220/blueprint 210 and a controller module 120. Consider anexample in which the control API includes a “create” function and thereexists a database server entity 110. That database server entity 110 mayinclude a control API implementation 230 that defines, for that createfunction, a set of operations that creates a logical database entity110. Accordingly, a controller module 120 that manages the databaseserver entity 110 may issue a create function API call to invoke thelogic of control API implementation 230 to create a logical databaseentity 110. Such logic may instruct a database server (an element 220)to create that logical database entity. In various embodiments, controlAPI implementation 230 may be different between operational entities110, where each operational entity 110 may uniquely implement one ormore of the functions that are supported by the control API.

In various embodiments, control API implementation 230 is implemented asa wrapper that encapsulates and hides underlying complexity of anelement 220. For example, a database server might include a service orcommand line tool responsible for starting and stopping the databaseserver, and control API implementation 230 may sit on top of the servicesuch that if a controller module 120 called a transition function of thecontrol API to transition the database server to online, then controlAPI implementation 230 may handle the communication with the databaseserver's service to start the database server. The complexity ofstarting the database server may be hidden from a controller module120—the controller module 120 may only have to make the appropriatecontrol API call.

In some embodiments, an operational entity 110 may advertise, to acontroller module 120, the functions of the control API that areimplemented by control API implementation 230 and thus are invokable. Insome cases, an operational entity 110 may advertise this informationupon being instantiated; in other cases, this information might beadvertised upon request by a controller module 120. For example, acontroller module 120 may issue a “describe” function API call to anoperational entity 110 to receive information about control APIimplementation 230. In some embodiments, a controller module 120 may beinstantiated to include information about control API implementation 230and may not have to communicate with an operational entity 110 toreceive such information.

With knowledge about an operational entity 110's control APIimplementation 230, a controller module 120 may be able to processinstructions. As an example, a controller module 120 may receive aninstruction to create a logical database entity 110. That controllermodule 120 might be managing a database server entity 110 thatadvertises that it can create a logical database entity 110. As such,the controller module 120 may then issue a create function API call tothat database server entity 110 to create a logical database entity 110.If, however, the information that is maintained by a controller module120 indicates that an instruction cannot be processed as the managedoperational entities 110 do not support the appropriate functions, thenthe controller module 120 may reject the instruction. In some cases, thecontroller module 120 may notify the issuing controller module 120 thatthe instruction has been or cannot be completed.

Turning now to FIG. 3A, a block diagram of a blueprint 210 and adefinition 310 within database 130 is shown. In the illustratedembodiment, blueprint 210 and definition 310 both include an entitydescriptor 320, relationship information 330, and variables 340 thatinclude an expected state variable 345. In some embodiments, blueprint210 and/or definition 310 may be implemented differently than shown. Forexample, blueprint 210 might correspond to more definitions 310 than onedefinition 310 as shown.

Definition 310, in various embodiments, is a collection of informationthat describes aspects of an operational entity 110. Similar toblueprint 210, definition 310 includes an entity descriptor 320,relationship information 330, and variables 340 as illustrated. Incontrast to blueprint 210, definition 310 may not define a particularinstance of an operational entity 110, but instead may describe whatvalues may be included in a corresponding blueprint 210. That is,definition 310 may describe what blueprint 210 should look like. As anexample, a definition 310 for a database fleet entity 110 might describedatabase fleets as including database server entities 110 while acorresponding blueprint 210 might define a particular database fleetentity 110 as including 15 database server entities 110. In variouscases, definition 310 may be used to validate that a correspondingblueprint 210 is permitted. Continuing the previous example, if ablueprint 210 defines a certain database fleet entity 110 as includingan application server entity 110 in addition to 15 database serverentities 110, then that blueprint 210 may be rejected as the definition310 of a database fleet entity 110 does not describe a database fleetentity 110 as including application server entities 110. In someembodiments, definition 310 may include a set of attributes withpredefined values and a set of attributes whose values will be writtenin the corresponding blueprint 210 by a controller module 120 when thatblueprint is deployed.

In some embodiments, blueprint 210 may correspond to multipledefinitions 310. For example, a blueprint 210 for a particular platformservice entity 110 may describe that platform service entity as having adatabase server entity 110 and an application server entity 110. Assuch, the blueprint 210 may be associated with a definition 310 for adatabase server entity 110 and a definition 310 for an applicationserver entity 110. In various cases, a blueprint 210 may not be valid ifit does not satisfy all the relationships specified by the definitions310 associated with that blueprint. For example, the definition 310 forthe application server entity 110 may describe the application serverentity 110 as depending on a metric server entity 110. As such, theblueprint 210 of the previous example may not be valid unless itdescribes a metric server entity 110. Accordingly, blueprint 210 maydescribe how a set of operational entities 110 are put together tosatisfy the relationships defined in the corresponding definitions 310.

Entity descriptor 320, in various embodiments, is a collection ofinformation describing various attributes of a corresponding operationalentity 110. These attributes may be the same across all operationalentities 110, but the values given may differ between operationalentities 110. For example, entity descriptor 320 may include a kindattribute that indicates whether an operational entity 110 is hardware,software, or information. Accordingly, an entity descriptor 320 for aprocessor entity 110 may indicate hardware while an entity descriptor320 for a metric server entity 110 may specify software. In variousembodiments, entity descriptor 320 conveys information to a controllermodule 120 about how a corresponding operational entity 110 may bemanaged. Continuing with the previous example, a controller module 120may know that it cannot clone that processor entity 110 because itsentity descriptor 320 specifies hardware for the kind attribute. Thevarious attributes of entity descriptor 320 are discussed in greaterdetail with respect to FIG. 3B.

Relationship information 330, in various embodiments, is a collection ofinformation that specifies the relationships between a particularoperational entity 110 and other operational entities 110. Therelationships between operational entities 110 may be defined usingvarious attributes that may be common across all relationships, butwhose values may differ between relationships. For example, relationshipinformation 330 might include a “type” attribute for each relationship.The relationship information 330 for an application server entity 110might specify that there is a “depend” type relationship between theapplication server entity 110 and a database server entity 110. Similarto entity descriptor 320, relationship information 330 may conveyinformation to a controller module 120 about how a correspondingoperational entity 110 may be managed. In various cases, therelationships between operational entities 110 may affect an order inwhich an operational scenario can be implemented—in which the commandsthat correspond to that operational scenario can be carried out.Continuing with the previous example, a controller module 120 mightlearn that the database server entity 110 ought to be instantiatedbefore the application server entity 110 because the application serverentity 110 depends on that database server entity 110. The variousattributes of relationship information 330 are discussed in greaterdetail with respect to FIG. 3C.

Variables 340, in various embodiments, is a collection of additionalinformation that is useful for managing a corresponding operationalentity 110. As shown, variables 340 include an expected state variable345. Expected state variable 345, in various embodiments, specifies theexpected state of the corresponding operational entity 110. For example,the expected state variable 345 for a database server entity 110 mightspecify a value of “online.” Variables 340 may be used to specify acurrent state, one or more service endpoints such as Internet Protocol(IP) ports, IP addresses, configuration variables, etc. For example,variables 340 may specify what persistent data stores that a particulardatabase server entity 110 should use. In various embodiments, variables340 may be hierarchical in nature. Variables 340 may further includeattributes such as whether they will be defined on deployment or atanother point in time. For example, an IP address variable 340 may beassociated with an attribute indicating that the IP address variable 340will be filled out during the deployment of the correspondingoperational entity 110.

Turning now to FIG. 3B, a block diagram of an entity descriptor 320 isdepicted. In the illustrated embodiment, entity descriptor 320 includesa universally unique type (UUT) 321, a lifecycle 322, a version 323, akind 324, a universally unique identifier (UUI) 325, a contextualidentifier 326, a vendor 327, a name 328, and a creation date 329.Entity descriptor 320 may include more or less information thanillustrated. For example, entity descriptor 320 may not include name328.

Universally unique type 321, in various embodiments, specifies a datavalue indicative of the type or species of an operational entity 110.Examples of UUTs 321 include, but are not limited to, “database server,”“application server,” “logical database,” “physical host system,”“database backup,” “tenant,” “workflow,” “log extension,” and “dataextension.” UUT 321, in some embodiments, may be used as a key forlooking up a corresponding definition 310 and/or blueprint 210. Forexample, relationship information 330 might specify the operationalentities 110 of a relationship using their UUTs 321. This may allow fora managing controller module 120 to access corresponding definitions 310and blueprints 210 to obtain information that may be pertinent tomanaging those entities 110. As discussed in more detail with respect toFIG. 8, UUT 321 (with lifecycle 322 and version 323, in various cases)may further be used to route an instruction to a particular operationalentity 110. Also, UUT 321 may be displayed to a user so that the usermay understand what operational entities 110 are present within system100.

Lifecycle 322, in various embodiments, specifies a data value indicativeof the stage at which an operation entity 110 is within its lifecycle.Examples of lifecycle stages include, but are not limited to,specification, snapshot, and instance. For example, a database backupimage may be the snapshot stage for a database. Lifecycle 322 may affectthe types of operations that can be performed in respect to anoperational entity 110. For example, when a database server entity 110is in its instance stage, a controller module 120 may be able toinstruct that database server entity to create a database backup image;however, if that database server entity 110 is in its specificationstage, the controller module 120 may not instruct that database serverentity to create the database backup image. In various embodiments,lifecycle 322 may be used with UUT 321 as a key for looking up acorresponding definition 310 and/or blueprint 210. In some embodiments,lifecycle 322 provides a path between different lifecycle stages and canbe used to automate the pipeline of an operational entity 110, e.g.,from source code to live production software through control API callsthat transition that operational entity through states.

Version 323, in various embodiments, specifies a data value indicativeof the version of an operational entity 110. For example, the version323 of a particular database server entity 110 may specify version“3.2.4”. Similar to lifecycle 322, version 323 may affect the types ofoperations that can be performed in respect to an operational entity110. For example, a newer version of an operational entity 110 mightinclude additional implementations for one or more of the functions ofthe control API. In some embodiments, version 323 may be used with bothUUT 321 and lifecycle 322 as a key for looking up a particulardefinition 310 and/or blueprint 210.

Kind 324, in various embodiments, specifies a data value that isindicative of the form or manifestation (i.e., hardware, software,information, or a formation) of an operational entity 110. As with otherattributes of entity descriptor 320, kind 324 may affect how anoperational entity 110 can be managed by a controller module 120. As anexample, if an operational entity 110 takes the form of software, thenit may be cloneable; however, another operational entity 110 that takesthe form of hardware may not be cloneable. In various embodiments, kind324 affects what values can be used for the other attributes in entitydescriptor 320. As an example, an operational entity 110 that takes theform of software may have a snapshot lifecycle stage, but an operationalentity 110 that is hardware may not.

Universally unique identifier (UUID) 325, in various embodiments,specifies a data value that uniquely identifies an operational entity110 independent of any other information specified by blueprint 210 ordefinition 310. As an example, a particular operational entity 110 mayhave a UUID 325 of “C7366F4-4BED-8BF0-BF281”. UUID 325 may enable aparticular operational entity 110 to be directly referenced by acontroller module 120 or a user. This may remove ambiguity in situationswhere a controller module 120 manages multiple of the same type ofoperational entity 110 (e.g., two database server entities 110). Asdiscussed in greater detail later, a given command may specificallyidentify an operational entity 110 using its UUID 325. As such,controller modules 120 may route a given command to the appropriatemanaging controller module 120 based on a UUID 325 that is identified bythat command.

Contextual identifier (CID) 326, in various embodiments, specifies adata value that is indicative of a context associated with anoperational entity 110. For example, CID 326 might specify anorganization ID for the organization/tenant that is associated with thecorresponding operational entity 110. In some embodiments, CID 326 maybe used to associate metrics of an operational entity 110 with aparticular tenant of system 100.

Vendor 327, in various embodiments, specifies a data value thatidentifies the vendor associated with an operational entity 110. Name328, in various embodiments, specifies a data value that identifies aname for an operational entity 110, such as a product name, workflowname, tenant name, etc. Creation date 329, in various embodiments,specifies a data value that identifies the time when an operationalentity 110 was created (e.g., in nanoseconds since the epoch UTC).

Turning now to FIG. 3C, a block diagram of relationship information 330is shown. In the illustrated embodiment, relationship information 330includes relationships 331. As further illustrated, a relationship 331includes a UUT 321, a lifecycle 322, a version 323, a relationship type332, a direction 333, a cardinality 334, and properties 336. Arelationship 331 may include more or less information than shown. Forexample, relationship 331 may not include version 323.

In many cases, the operational entities 110 within a system 100 may berelated in some manner. As an example, an operational entity 110 thatcollects metric information from another operational entity 110 dependson the existence of that other entity. In various embodiments, themanner in which an operational entity 110 is managed by a controllermodule 120 depends on the relationships 331 that exist between thatoperational entity and other operational entities 110. As depicted, anentity's relationships 331 are defined in relationship information 330and include multiple variables.

In order to identify the operational entities 110 that a particularoperational entity 110 is related to, in various embodiments, arelationship 331 specifies a UUT 321, a lifecycle 322, and a version323. For example, a controller module 120 may control a database serverentity 110. Accordingly, a relationship 331 corresponding to therelationship between the controller module 120 and the database serverentity 110 might specify a UUT 321 of “database server,” a lifecycle 322of “instance,” and a version 323 of “3.21.” In some embodiments, arelationship 331 may indicate UUIDs 325 that specifically identify theoperational entities 110 associated with that relationship.

Relationship type 332, in various embodiments, specifies a data valueindicative of the type of relationship between a certain operationalentity 110 and one or more other operational entities 110. The types ofrelationships include, but are not limited to, a “host” relationship, a“control” relationship, a “depend” relationship, a “consist of”relationship, a “contained in” relationship, a “fraction” relationship,and a “provision” relationship. A host relationship, in variousembodiments, is a relationship in which a particular operational entity110 hosts one or more other operational entities 110. As an example, adatabase server entity 110 may host a logical database entity 110. Acontrol relationship, in various embodiments, is a relationship in whicha particular operational entity 110 controls one or more otheroperational entities 110. As an example, a controller module 120 maycontrol a metric server entity 110 and a database server entity 110. Adepend relationship, in various embodiments, is one in which aparticular operational entity 110 depends on one or more otheroperational entities 110. As an example, a metric server entity 110 maydepend on a database server entity 110 existing so that it might gathermetrics. A “consist of” relationship, in various embodiments, is one inwhich a particular operational entity 110 consists of one or more otheroperational entities 110. As an example, a database service entity 110may consist of two database server entities 110. A “contained in”relationship, in various embodiments, is one in which a particularoperational entity 110 is contained in one or more other operationalentities 110. As an example, a database server entity 110 may becontained in a container entity 110. A provision relationship, invarious embodiments, is one that identifies one or more operationalentities 110 that may be provisioned by a particular operational entity110. As an example, a container environment entity 110 may provision (orinstantiate) containers entities 110. In some embodiments, there may bean “I am” relationship where a particular operational entity 110describes itself. For example, a database server entity 110 might havean “I am” relationship value of “database server.”

Direction 333, in various embodiments, specifies a data value indicativeof the direction of a relationship between a particular operationalentity 110 and one or more other operational entities 110. Direction 333may indicate if a particular operational entity 110 is subservient toanother operational entity 110. Consider an example in which there is arelationship between a database server entity 110 and a logical databaseentity 110. The relationship 331 defined from the perspective of thedatabase server entity 110 might specify a relationship type 332 of“host” and a direction 333 of “false.” But the relationship 331 definedfrom the perspective of the logical database entity 110 may specify arelationship type 332 of “host” and a direction 333 of “true.” Theresulting interpretation of the two relationships 331 may be that thedatabase server entity 110 hosts the logical database entity 110 and thelogical database entity 110 is hosted by the database server entity 110.As another example, direction 333 may indicate that a controller module120 controls a database server entity 110 (in that controller'srelationship information 330) and that the database server entity 110 iscontrolled by that controller module (in that database server'srelationship information 330).

Cardinality 334, in various embodiments, specifies a data value that isindicative of the number of operational entities 110 that are associatedwith a corresponding relationship type 332 (which may exclude theparticular operational entity 110 for which the correspondingrelationship 331 is defined). For example, a database service entity 110may consist of three database server entities 110. As a result,cardinality 334 may specify a value of “3” for the relationship 331between the database service entity 110 and the three database serverentities 110 from the perspective of that database service entity.

Properties 335, in various embodiments, specify additional data valuesthat are useful for managing a corresponding operational entity 110.Properties 335 may specify the protocol used by the related operationalentities 110 to communicate, the status of the relationship that existthose operational entities, where those operational entities are locatedwithin system 100, etc. As an example, properties 335 may indicate thatthe operational entities 110 of a particular relationship are up andrunning. As the states of different relationships change within system100, controller modules 120 may update relationship information 330(e.g., update properties 335).

In a similar manner to the entity descriptor 320, in some embodiments,relationships 331 may convey information to a controller module 120 tohelp it understand how to manage operational entities 110. Consider anexample in which a metric server entity 110 depends on a database serverentity 110. A controller module 120 may determine, when wishing to startup the metric server entity 110, that the database server entity 110needs to be started first as a result of the metric server entity 110depending on the database server entity 110. Accordingly, a controllermay use relationship information 330 along with the information fromdefinitions 310 and blueprints 210 to reason about how to transitionoperational entities 110 between states (e.g., from offline to online).Relationship information 330 may, in various cases, be used to calculatethe resource utilization of a system. For example, a container entity110 may be contained by a host system entity 110—the container entity110 thus uses a portion of that host system's resources. Similarly, thesoftware that is contained within a container entity 110 uses a portionof that container's resources. This information may be useful forprovisioning and automated capacity planning.

Turning now to FIG. 3D, a block diagram of relationships between exampleoperational entities 110 is shown. In the illustrated embodiment,operational entity 110A is an application server entity, operationalentity 110B is a database server entity, and operational entity 110C isa metric server entity. As shown, operational entity 110B depends onoperational entity 110A and there is a codependency between operationalentities 110A and 110C. As discussed, the relationships betweenoperational entities 110 may affect how controller modules 120 managethose operational entities. For example, when instantiating operationalentities 110A-C, controller modules 120 may instantiate operationalentity 110A before operational entity 110B as operational entity 110Bdepends on the existence of operational entity 110A.

Turning now to FIG. 4, a block diagram of a controller module 120 isshown. In the illustrated embodiment, controller module 120 includesoperational entity information 410, control API information 420, anoperational entity manager engine 430, a workflow engine 440, and areasoning engine 450. In some embodiments, controller module 120 may beimplemented differently than shown. For example, controller module 120may not include reasoning engine 450.

As previously mentioned, controller module 120 may manage operationalentities 110 and controller modules 120. To manage them, in variousembodiments, controller module 120 maintains operational entityinformation 410 and control API information 420. In some cases,controller module 120 may maintain information 410 and 420 in a localstorage; in other cases, it may maintain information 410 and 420 atdatabase 130—this may enable controller module 120 to continue where itleft off when it crashes as the local storage may not be persistent.That is, if controller module 120 crashes (or its container crashes),the information stored in its local storage may disappear along with it.Accordingly, any information that may be pertinent to the management ofcontroller module 120's operational entities 110 may be maintained atdatabase 130, which may be a non-volatile persistent storage. In yetother cases, controller module 120 may maintain information 410 and 420in both its local storage and database 130.

When instantiated, in various embodiments, controller module 120 may beprovided a properties file that provides initial information. Thisinitial information may identify locations of the controller module120's local storage and/or database 130 that include operational entityinformation 410 and control API information 420 that is relevant to thatcontroller module. In some cases, the properties file may indicate theoperational entities 110 that controller module 120 is responsible formanaging and may indicate ports to listen on with respect to thoseentities 110 and other controller modules 120. In various embodiments,controller module 120 accesses information 410 and 420 using itsproperties file.

Operational entity information 410, in various embodiments, isinformation describing the operational entities 110 that are managed bycontroller module 120. Information 410 may include blueprints 210 anddefinitions 310 for the managed operational entities 110. In variousembodiments, operational entity manager engine 430 uses operationalentity information 410 to determine the intended states of itsoperational entities 110. With such knowledge, manager engine 430 maytransition its operational entities 110 (e.g., by issuing control APIcalls) toward their intended states. In some instances, controllermodule 120 may be instantiated such that it includes operational entityinformation 410; in yet some instances, controller module 120 may issuecontrol API calls to its operational entities 110 in order to retrieveinformation 410 (e.g., blueprints 210) from them.

Control API information 420, in various embodiments, is information thatindicates the functions of the control API that are implemented by theoperational entities 110 managed by controller module 120. As discussedearlier, in various embodiments, an operational entity 110 includes acontrol API implementation 230 implementing one or more functions of thecontrol API. Through the control API implementation 230, controllermodule 120 may interface with the elements 220 of that operationalentity 110. Accordingly, control API information 420 may indicate theone or more functions implemented by a control API implementation 230.In some cases, controller module 120 may be instantiated such that itincludes control API information 420; in some cases, controller module120 may issue a “describe” function call (of the control API) to theoperational entities 110 that its manages in order to receive controlAPI information 420 from them. Note that, in various embodiments, eachoperational entity 110 may implement the “describe” function call.

In various embodiments, control API information 420 include a functionmap that maps certain information about an operational entity 110 to thefunctions that that operational entity 110 implements. In various cases,the information that is mapped to the functions may include anoperational entity's UUT 321 and lifecycle 322. Note that, in someinstances, an operational entity 110 might include differentimplementations of the same API function call for different lifecyclestages. As discussed in more detail with respect to FIG. 8, controllermodule 120 may use an operational entity's UUT 321, lifecycle 322,and/or UUID 325 to route commands.

Operational entity manager engine 430, in various embodiments, is a setof software routines executable to manage operational entities 110. Notethat a controller module 120 may be considered an operational entity 110in various cases—it may be associated with a definition 310 and ablueprint 210. As such, manager engine 430 may manage controller modules120 as well. To manage operational entities 110 and controller modules120, in various embodiments, manager engine 430 includes variousmodules, such as a scheduler module, a sweeper module, a healthassessment module, and an investigator module.

The scheduler module, in various embodiments, is a set of functionalitythat determines when to make changes to operational entities 110 thatare being managed by controller module 120. In various embodiments,scheduler module causes actions to be performed by scheduling them to beperformed by other components of manager engine 430. The schedulermodule may be declarative (e.g., “this operational entity 110 should bein this intended state”) or imperative (“create a snapshot of a DB”). Toschedule actions, the scheduler module may write requested actions(e.g., commands from a user) with scheduled times to the local storageand/or database 130. The scheduler module may also write the progressand outcomes of scheduled actions to the local storage and/or database130. Such information may be written to database 130 so that ifcontroller module 120 crashes, a new instance of controller module 120may pick up where the other one crashed. In various cases, schedulermodule may schedule the times at which the sweeper module probesoperational entities 110.

The sweeper module, in various embodiments, is a set of functionalitythat probes the operational entities 110 that are being managed tocollect information about the health of those operational entities(e.g., resource utilization versus capacity, major health indicators,etc.). In some embodiments, the sweeper module reads operational entityinformation 410 from a local storage (which might be persistent) ordatabase 130. From operational entity information 410, the sweepermodule may learn about the operational entities 110 that are beingmanaged by its controller module 120 and how to connect to them. Thesweeper module may then probe those operational entities. In someembodiments, the sweeper module sends a status request message to eachof the listed operational entities 110 that requests informationdetailing the current state of that operational entity 110. Instead ofthe sweeper module initially sending the status request message, in someembodiments, the operational entities 110 may periodically sendinformation to the sweeper module that indicates their current state. Invarious embodiments, the sweeper module stores the information receivedfrom operational entities 110 as a part of operational entityinformation 410. Such information may indicate resource utilization,resource capacity, the status of that operational entity 110 (e.g.,offline, online, etc.), the status of that operational entity'srelationships with other operational entities 110 (e.g., the otheroperational entity 110 is not respondent), etc. The sweeper module maystore any alerts that may have been triggered as “incidents” to beinvestigated. As an example, if the sweeper module does not hear from anoperational entity 110, then it may store an indication that thatoperational entity might not be healthy. Information that is no longeroperationally relevant (e.g., old, irrelevant records) may be removedfrom the local storage and/or database 130.

The health assessment module, in various embodiments, is a set offunctionality that assesses the health of the managed operationalentities 110 using the information obtained by the sweeper module. Invarious embodiments, the health assessment module reads operationalentity information 410 from the local storage or database 130. Thehealth assessment module may then determine, based on operational entityinformation 410, whether to create a report to trigger the investigatormodule to investigate an operational entity 110. For example, the healthassessment module may assess the resource utilization of an operationalentity 110 and, if the resource utilization is too high or low relativeto what it ought to be, then the health assessment module may create areport for that operational entity 110. In various embodiments, thehealth assessment module may attempt to predict further events based onhistorical operational entity information 410. For example, if anoperational entity 110 shows signs of following a certain trend thatends with the operational entity 110 failing, then the health assessmentmodule may create a report to preemptively have that operational entityinvestigated.

The investigator module, in various embodiments, is a set offunctionality that inspects operational entities 110. The investigatormodule may check for reports that have not yet been investigated. Foreach report, the investigator module may collect information thatpertains to the relevant operational entities 110; such information maybe leveraged by other components of system 100 or users to troubleshootany issues. For example, the investigator module might collect loginformation detailing operations performed by an operational entity 110prior to the operational entity 110 failing. In various cases, theinvestigator module may have access to relationship information 330 forwhatever entity is not healthy. For example, the investigator module mayhave access to relationship information 330 for a database server entity110 that is not healthy and depends on another service entity 110 thatalso might not be healthy. The fact that the database server entity 110is not healthy may only be a symptom and the place to investigate mayreally by the service that it depends on. Accordingly, the investigatormodule may use the control API to drill down into the health of thatservice to determine if it is causing problems for the database serverentity 110. In some embodiments, the investigator module may attempt totroubleshoot any issues that it discovers. For example, the investigatormodule may issue a set of commands to restart/reinitialize anoperational entity 110 that has crashed. The investigator module mayupdate a report' state and ownership after the automated investigationis complete (e.g., if the auto investigation failed to troubleshoot theissue, the ownership may be transferred to a user).

Accordingly, in various embodiments, the sweeper module gathers healthinformation about the operational entities 110 that are managed by itscontroller module 120. That heath information may be assessed by thehealth assessment module to determine if there are issues with thoseoperational entities 110. If there are potential issues, then the issuesmay be reported to the investigator module for further analysis.

In some embodiments, operational entity manager engine 430 may receiveinstructions pertaining to the management of operational entities 110(in some cases, those under manager engine 430). For instructionspertaining to operational entities 110 not under manager engine 430,manager engine 430 may route the instructions to the appropriatecontroller modules 120 for processing. In some cases, manager engine 430may route instructions based on information included in thoseinstructions, such as a UUID 325 value of a corresponding operationalentity 110. For example, manager engine 430 may determine, based on aUUID 325 value, a certain controller module 120 that manages theoperational entity 110 that corresponds to that UUID 325 value.Accordingly, manager engine 430 may route the corresponding instructionto that controller module 120 for processing.

For instructions that pertain to operational entities 110 that are underthe management of manager engine 430, manager engine 430 may process theinstructions, which may include changing states of one or moreoperational entities 110. In various cases, manager engine 430 mayaccess control API information 420 in order to determine which functionsof the control API are available for invoking. An instruction mayidentify a corresponding operational entity 110 and a function to beperformed with respect to that operational entity. Accordingly, if theappropriate control API function has been implemented by the operationalentity 110 (as may be determined from control API information 420), thenmanager engine 430 may send a control API call to the operational entity110 to execute that control API function. In some instances, managerengine 430 may invoke a control API function implementation of anotheroperational entity 110 in order to make a change to the originaloperational entity 110. In issuing a control API call, manager engine430 may carry out the received instruction.

Workflow engine 440, in various embodiments, is a set of softwareroutines executable to implement workflows. As noted, an operationalscenario might be described in a workflow that includes a set ofcommands for implementing the sequence of steps of that operationalscenario. The commands may identify operations (e.g., state changes) tobe performed on certain operational entities 110. In variousembodiments, workflow engine 440 implements a workflow by issuinginstructions to operational entities 110 and/or controller module 120 tocarry out the commands of the workflow. As an example, workflow engine440 may issue an instruction to a controller module 120 to change thestate of an operational entity 110 (managed by the controller module120) from “offline” to “online.” In various cases, workflow engine 440may obtain workflows from database 130; in some cases, workflow engine440 may obtain workflows from reasoning engine 450.

Reasoning engine 450, in various embodiments, is a set of softwareroutines executable to generate a workflow based on a high-level goal.Reasoning engine 450 may initially receive a request from a user toimplement a particular high-level goal. For example, a user may requestthat a database server entity 110 be upgraded from one version toanother version. Reasoning engine 450, in various embodiments, “reasons”about the requested high-level goal in order to generate a workflowhaving a set of commands that implement the goal. In some instances, theoutput of reasoning engine 450 (e.g., a workflow) may be provided toworkflow engine 440 to implement the high-level goal via the output.Reasoning engine 450 may greatly reduce the amount of specificoperational code that has to be written by developers of system 100.Reasoning engine 450 is discussed in more detail with respect to FIG.17.

Turning now to FIG. 5, a flow diagram of a method 500 is shown. Method500 is one embodiment of a method performed by a computer system (e.g.,system 100) for managing an operational scenario (e.g., in operationalinformation 135) for a target computer environment (e.g., targetenvironment 137). Method 500 may be performed by executing a set ofprogram instructions stored on a non-transitory computer-readablemedium. In some cases, method 500 may be performed in response to thecomputer system receiving a request from a user. In some embodiments,method 500 may include additional steps. As an example, the computersystem may access definitions (e.g., definitions 310) to validateblueprints (e.g., blueprints 210).

Method 500 begins in step 510 with the computer system accessingoperational information (e.g., operational information 135) defining aset of commands for the operational scenario. The operational scenariomay include changing states of one or more software entities included ina set of operational entities to transition the one or more softwareentities from a first software version to a second software version.

In step 520, the computer system accesses blueprints (e.g., blueprints210) for the set of operational entities (e.g., operational entities110) that are to be utilized in the target computer environment forimplementing the operational scenario. A given blueprint might indicate,for a first one of the set of operational entities, a set ofrelationships (e.g., relationships 331) between the first operationalentity and one or more other operational entities of the set ofoperational entities. The set of operational entities may include ahardware entity (e.g., a set of processors), a software entity (e.g., adatabase server that executes on at least one of the set of processors),and an information entity (e.g., a logical database that is managed bythe database server).

In step 530, the computer system implements the operational scenario forthe target computer environment. In various cases, implementing theoperational scenario may include executing a hierarchy of controllermodules (e.g., controller modules 120) that may include an orchestratorcontroller module at top level of the hierarchy that is executable tocarry out the set of commands by issuing instructions to controllermodules at a next level of the hierarchy. In various cases, thehierarchy of controller modules may include controller modules that areexecutable to manage the set of operational entities according torespective blueprints in order to complete the operational scenario,including by changing states of one or more of the set of operationalentities. In some cases, a first operational entity of the set ofoperational entities may be at a different level within the hierarchythan a second operational entity of the set of operational entities.Accordingly, ones of the controller modules that are executable tomanage the set of operational entities may be at different levels of thehierarchy.

In some embodiments, a given operational entity implements one or moreof a set of functions (e.g., control API implementation 230) that aresupported by a control application programming interface (API). The oneor more implemented functions may allow a controller module to change astate of the given operational entity. In various cases, a particularone of the blueprints may be associated with the given operationalentity and may specify a lifecycle value (e.g., a value for lifecycle322) indicative of a current lifecycle stage (e.g., specification stage)associated with the given operational entity. The lifecycle value may beusable by a controller module for determining which of the one or moreimplemented functions are callable for the lifecycle stage.

In some embodiments, the given operational entity is associated with aunique identifier (e.g., a value for UUID 325) that uniquely identifiesthat given operational entity. A particular one of the instructions maybe associated with the given operational entity. In some instances,issuing the particular instruction might include determining, based onthat unique identifier, a particular one of the controller modules thatmanages that given operational entity and issuing the particularinstruction to the particular controller module. The particularinstruction may include causing the software entity (e.g., a databaseserver) to instantiate another information entity (e.g., a logicaldatabase).

In some embodiments, the set of relationships specified for the firstoperational entity affect an order in which ones of the set of commandscan be carried out. In some cases, the set of relationships may includea relationship between the first operational entity and a second one ofthe set of operational entities. As such, performing a particular one ofthe instructions to change a state of the first operational entity mayinclude changing a state of the second operational entity prior tochanging the state of the first operational entity. In some cases, therelationship between the first operational entity and the secondoperational entity may be a dependence relationship in which the firstoperational entity depends on existence of the second operationalentity.

Turning now to FIG. 6, a flow diagram of a method 600 is shown. Method600 is one embodiment of a method performed by a computer system (e.g.,system 100) for managing a set of operational entities (e.g.,operational entities 110). Method 600 may be performed by executing aset of program instructions stored on a non-transitory computer-readablemedium. In some embodiments, method 600 may include additional steps. Asan example, the computer system may maintain a database (e.g., database130) that stores operational information (e.g., operational information135).

Method 600 begins in step 610 with the computer system executing ahierarchy of controller modules (e.g., controller modules 120) having anorchestrator controller module at a top level of the hierarchy that isoperable to communicate with other controller modules in the hierarchyto manage the set of operational entities. In various cases, the set ofoperational entities may include a hardware entity, a software entity,and an information entity.

In step 620, the orchestrator controller module accessing operationalinformation (e.g., operational information 135) that specifies aworkflow having commands for implementing a sequence of steps of anoperational scenario involving the set of operational entities.

In step 630, the orchestrator controller module implements the commandsof the workflow to implement the operational scenario. In some cases,implementing the commands may include issuing instructions to one ormore of the other controller modules to change states of one or more ofthe set of operational entities. Particular controller modules withinthe hierarchy may be operable to manage the set of operational entitiesaccording to respective blueprints (e.g., blueprints 210) that defineattributes (e.g., UUT 321, lifecycle 322, version 323, etc.) of the setof operational entities. In some embodiments, the attributes includerelationship attributes (e.g., relationship type 332, direction 333,etc.) defining information that pertains to relationships (e.g.,relationships 331) between ones of the set of operational entities.These relationships may affect an order in which the states of the oneor more operational entities are to be changed based on theinstructions.

In some cases, the one or more other controller modules may be operableto route one or more of the instructions received from the orchestratorcontroller module to the particular controller modules that manage theset of operational entities. In some cases, the one or more othercontroller modules may be operable to route the one or more instructionsbased on a set of unique identifiers (e.g., UUIDs 325) corresponding tothe set of operational entities.

Turning now to FIG. 7, a block diagram of a control API 700 is shown. Inthe illustrated embodiment, control API 700 includes a describe function710, a fetch function 720, a transition function 730, a create function740, a destroy function 750, a perturb function 760, a validate function770, and an analyze function 780. In various embodiments, control API700 may include more functions than shown. For example, control API 700may include a log function that enables a controller module 120 toretrieve log information maintained by an operational entity 110.

Control API 700, in various embodiments, is a collection of functions orAPI calls that are invokable to access information from an operationalentity 110, make a change to an aspect of the operational entity 110(e.g., transition that entity to another state), and/or make a change toanother operational entity 110 managed by the operational entity 110.Control API 700 may include a selected set of functions that are commonacross all operational entities 110 in system 100, but the functionalityof which is individually defined for each operational entity 110. Forexample, control API implementations 230A and 230B each might definecreate function 740 to create an operational entity 110, but theparticular type of operational entity 110 created by control APIimplementation 230A may differ from the particular type created bycontrol API implementation 230B. In various embodiments, an operationalentity 110 may support multiple different implementations of the sametype of function. As an example, a database server entity 110 maysupport two implementations of create function 740: one to create alogical database entity 110 and another to create a backup image entity110. In some cases, an operational entity 110 may not support allfunction provided by control API 700. For example, perturb function 760may not be implemented for a logical database entity 110.

Describe function 710, in various embodiments, returns informationpertaining to an operational entity 110. The information may include anoperational entity 110's blueprint 210, definition 310, and/orinformation pertaining to its control API implementation 230. Consideran example in which a controller module 120 wishes to discover whatoperational entities 110 that a particular operational entity 110depends upon. That controller module 120 may invoke the describefunction 710 of that particular operational entity 110 to receive itsblueprint 210, which may identify the relationships 331 of theparticular operational entity 110. In some cases, describe function 710may be called to determine which of the other functions (e.g., functions720, 730, etc.) have been implemented in an operational entity's controlAPI implementation 230. Accordingly, in some embodiments, eachoperational entity 110 defines describe function 710 so controllermodules 120 may have a guaranteed way of learning about thoseoperational entities 110.

Fetch function 720, in various embodiments, fetches one or morevariables 340 for an operational entity 110. As an example, a controllermodule 120 may invoke the fetch function 720 of an operational entity110 to access a particular variable 340 that is indicative of whetherthat operational entity is “online” or “offline.” When invoking fetchfunction 720, a controller module 120 may specify the requestedvariables 340 as inputs into fetch function 720. In some cases, acontroller module 120 may invoke describe function 710 to determine whatvariables 340 may be requested from a specific operational entity 110via its control API implementation 230. In some embodiments, theinformation returned by fetch function 720 may indicate certainproperties of the returned variables 340. Such properties might includethe name of a variable 340, its value, the minimum and maximum possiblevalues for that variable, a data type (e.g., bool, integer, string,float, etc.), an information type (e.g., counter, rate, etc.), a unittype (e.g., seconds, kilobytes, etc.), and flags (e.g., mutable,canonical, etc.). For example, fetch function 720 may return informationthat specifies a variable 340 having a name of “status,” a value of“online,” and a flag of “mutable.”

Transition function 730, in various embodiments, transitions or changesone or more variables 340 (or other information such as the valuesincluded in entity descriptor 320 and/or relationship information 330)for an operational entity 110 from a first value to a second value. Asan example, the transition function 730 of a database server entity 110might be invoked to change a status variable 340 from “offline” to“online.” The control API implementation 230 of that database serverentity 110 may invoke software routines that cause the database serverentity 110 to transition from offline to online. The control APIimplementation 230 may then update the status variable 340 from“offline” to “online.” That control API implementation 230 may hide theunderlying complexity of transitioning the database server to an onlinestate from the controller module 120 that invokes transition function730. That is, from the perspective of a controller module 120, invokingtransition function 730 may change a variable 340, while control APIimplementation 230 may actually implement the changes signified by thechange in that variable. As such, in various embodiments, transitionfunction 730 enables a controller module 120 to transition anoperational entity 110 from a first state to a second state.

Other examples of using transition function 730 may include updating anoperational entity 110 to a new version, disabling an operational entity110 from creating other operational entities 110, changing aconfiguration specification of an operational entity 110, shutting downan operational entity 110, etc. In various embodiments, implementing anoperational scenario may involve issuing multiple transition function730 calls to multiple operational entities 110 to change their states.

Create function 740, in various embodiments, causes an operationalentity 110 to create another operational entity 110. For example, adatabase server entity 110 may implement create function 740 to createlogical database entities 110. As a result, a controller module 120 mayinvoke that create function 740 to create a logical database entity 110if desired. In various embodiments, an operational entity 110 mightcontrol other operational entities 110 in that it may carry out actionson those operational entities or on behalf of those operationalentities. In various cases, an operational entity 110 that controlsanother operational entity 110 may have the ability to create, destroy,list, and/or describe that other operational entity. In some cases, anoperational entity 110 might create another operational entity 110 bycloning an operational entity 110 that is the same as that otheroperational entity. In some cases, an operational entity 110 mightcreate another operational entity 110 by transiting that otheroperational entity along its lifecycle stages. For example, create asnapshot of a database persistence by transitioning a databasepersistence from an instance stage to a snapshot stage. In variousembodiments, create function 740 receives source information as inputthat identifies a base (e.g., a disk image file) upon which to create anew operational entity 110.

Destroy function 750, in various embodiments, destroys/removes anoperational entity 110 from system 100. In some cases, a controllermodule 120 may invoke the destroy function 750 of a particularoperational entity 110 to destroy that operational entity in response toit no longer being needed in system 100. For example, a database serverentity 110 that was created when there was high demand traffic may bedestroyed when there is less traffic. In some cases, the destroyfunction 750 of an operational entity 110 may be invoked if thatoperational entity is malfunctioning. A controller module 120 might theninstantiate another of the same type of operational entity 110.

Perturb function 760, in various embodiments, perturbs an operationalentity 110 by causing that operational entity to behave anomalously. Forexample, the perturb function 760 of a particular operational entity 110may inject faults into that operational entity. Such faults mightinclude, for example, causing the particular operational entity 110 tocrash, hang, or shut down. As discussed in more detail with respect toFIG. 24, perturb function 760 may be helpful in testing system 100 bycausing issues in system 100 in order to see if system 100 can recoverfrom those issues.

Validate function 770, in various embodiments, validates an operationalentity 110 for correctness. For example, a controller module 120 mayinvoke the validate function 770 of an operational entity 110 todetermine if the configuration and/or environment of that operationalentity is correct (e.g., to determine if the configuration is using theappropriate values). Analyze function 780, in various embodiments,gathers metric information for an operational entity 110. As an examplea controller module 120 may invoke the analyze method of an operationalentity 110 to obtain an error log from that operational entity 110.

Turning now to FIG. 8, a block diagram of a routing engine 810 androutable entities 830 is shown. In the illustrated embodiment,controller module 120 includes operational entity manager engine 430having routing engine 810. As further illustrated, routable entity 830Ais located locally with respect to controller module 120 (e.g., locatedon the same network) while routable entity 830B is located remotely withrespect to controller module 120.

As previously noted, operational entities 110 and controller modules 120may form a hierarchy having an orchestrator controller module 120 at thetop level. When implementing an operational scenario, the orchestratorcontroller module 120 may issue instructions through the hierarchy tocontroller modules 120 that manage operational entities 110. Suchcontroller modules 120 may carry out the received instructions byissuing control API calls to invoke the appropriate functions of controlAPI 700 that are implemented by those managed operational entities 110.By invoking the functions of control API 700, controller modules 120 maychange the states of the operational entities 110.

Instructions may be received and/or accessed from various sources. Insome cases, an instruction may be initially received from a command linetool that translates a human-readable command (entered by a user or adhoc script) into an instruction that is understood by controller modules120. The instruction derived from a command entered into the commandline may be initially received by an orchestrator controller module 120that may propagate the instruction through the hierarchy of operationalentities 110 and controller modules 120. In various cases, instructionsmay be derived from workflow information stored in database 130.Accordingly, after being instructed (e.g., via the command line tool),an orchestrator controller module 120 may access workflow informationfrom database 130 and propagate the instructions associated with theworkflow information through the hierarchy. In some cases, the workflowinformation may include human-readable commands that may be translatedby a controller module 120 into instructions for implementing thosecommands.

When a controller module 120 receives an instruction, the controllermodule 120 may make a routing decision. If the instruction correspondsto an operational entity 110 managed by the controller module 120, thenthe controller module 120 may issue the appropriate control API 700 callto that operational entity 110. But if the instruction corresponds to anoperational entity 110 managed by another controller module 120, thenthe first controller module 120 may route the instruction to that othercontroller module. In order to route instructions and invoke thefunctions of control API 700, in various embodiments, a controllermodule 120 includes a routing engine 810.

Routing engine 810, in various embodiments, is a set of softwareroutines executable to route instructions and to invoke the functions ofcontrol API 700. Instructions may be routed based on informationincluded in those instructions. Such information may include a UUT 321of the operational entity 110, a lifecycle 322 of the operational entity110, a UUID 325 of the operational entity 110, a UUID 325 of thecontroller module 120 that manages the operational entity 110, a UUID325 of a container entity 110 that includes the operational entity 110and the controller module 120, a variable 340 name, a source, and/orother information, which may be included in blueprints 210 and/ordefinitions 310, such a name 328 of the operational entity 110. As anexample, an instruction might correspond to fetch function 720 and mightspecify a variable 340 name (e.g., “status”) to fetch, the UUID 325 ofthe operational entity 110 from which to fetch that variable, and theUUID 325 of the controller module 120 that manages that operationalentity and thus should invoke the fetch function 720 of that operationalentity.

After receiving an instruction, in various embodiments, routing engine810 determines whether the instruction should be routed to anothercontroller module 120 or a certain function of control API 700 should becalled. To determine if that instruction should be routed, in someembodiments, routing engine 810 determines whether the instructionspecifies a UUID 325 for a controller module 120. If that instructionspecifies a UUID 325 for a controller module 120, but the specified UUID325 belongs to another controller module 120, then routing engine 810may route the instruction. In some cases, routing engine 810 maydetermine that an operational entity 110 is not local to its controllermodule 120 if its controller module 120 does not have access to acontrol API implementation 230 for that operational entity. Routingengine 810 may make this determination based on a map of local controlAPI implementations 230 that is maintained by its controller module 120.In some cases, the UUID 325 plus the control API call identified in theinstruction may be used as a key into the map. If the map does not havean entry for such a key, then routing engine 810 may route theinstruction.

In some instances, routing engine 810 may route that instruction bybroadcasting it to each controller module 120 that its controller module120 manages. In some instances, if the specified UUID 325 belongs to acontroller module 120 managed by routing engine's controller module 120,then routing engine 810 may provide that instruction directly to thatcontroller module. Routing engine 810 may use blueprints 210 todetermine who manages the operational entity 110 for whom theinstruction is for. An instruction might not specify a UUID 325 for acontroller module 120; the instruction, however, may still specify aUUID 325 of the operational entity 110. As such, routing engine 810 maydetermine, based on operational entity information 410, whether itscontroller module 120 manages that operational entity. If its controllermodule 120 does not manage that operational entity, then routing engine810 may broadcast that instruction to each controller module 120 thatits controller module 120 manages. In some embodiments, routing engine810 determines whether its controller module 120 manages the operationalentity 110 by attempting to look up the operational entity's informationin operational entity information 410 using a UUID 325 of thatoperational entity (or using other information such as UUT 321). In someembodiments, a routing table may be used that advertises thecapabilities of each controller module 120 along with what operationalentities 110 that they manage. Accordingly, routing engine 810 may usethis routing table to determine where to route an instruction.

If a received instruction corresponds to an operational entity 110managed by routing engine's controller module 120, then routing engine810 may check whether that operational entity implements a function ofcontrol API 700 for handling the action/operation indicated by theinstruction. As discussed previously, in various embodiments, acontroller module 120 may store a function map in control APIinformation 420 that maps functions of control API 700 to an operationalentity's UUT 321 and lifecycle 322. Accordingly, routing engine 810 maybuild a list of functions of control API 700 that have been implementedby the operational entity 110 based on the function map, UUT 321, andlifecycle 322. In cases where an instruction does not specify a UUT 321and a lifecycle 322 for the operational entity 110, then routing engine810 may look up the operational entity's blueprint 210 using a UUID 325that may be specified in the instruction for that operational entity.Routing engine 810 may then extract UUT 321 and lifecycle 322 from theaccessed blueprint 210. If no blueprint 210 can be located, then routingengine 810 may return an error to the issuer of the instruction. Invarious embodiments, routing engine 810 builds the list of functions byselecting functions indicated in the function map that correspond to theUUT 321 and lifecycle 322 of the relevant operational entity 110.

After building a list of implemented functions, in various embodiments,routing engine 810 determines whether there is a function included inthat list for implementing the operation requested by the instruction.For example, if the instruction identifies a transition operation fortransitioning a particular variable 340 to a new value, then routingengine 810 may determine, based on the list, whether the operationalentity 110 implements a transition function 730 for transitioning thatparticular variable. If so, then routing engine 810 may invoke thattransition function; otherwise, routing engine 810 may return an errorto the issuer of the instruction. In this manner, routing engine 810 mayprocess received instructions.

When routing an instruction or invoking a function of control API 700,routing engine 810 may make a call to routing layer 820. Routing layer820, in various embodiments, is a set of software routines, hardware, ora combination thereof that is operable to route an instruction toanother component (an operational entity 110 or a controller module 120)and/or invoke a function of control API 700. Routing layer 820 mayreceive a request from controller module 120 to send an instruction toanother particular controller module 120 or to invoke a particularfunction implemented by a particular operational entity 110 for controlAPI 700. Accordingly, that request may specify the instruction, a UUID325 of a controller module 120, a UUID 325 of an operational entity 110,and/or a function call. Routing layer 820 may determine whether aninstruction is to be routed or a function is to be called based on thecontents of the request that is received from controller module 120. Ifthe request specifies an instruction, then routing layer 820 may locatean appropriate controller module 120 (e.g., based on a UUID 325) andsend the instruction to that controller module. If the request specifiesa function, then routing layer 820 may locate the appropriateoperational entity 110 (e.g., based on a UUID 325) and invoke thefunction implemented by that operational entity.

In various cases, routing layer 820 may have to communicate withoperational entities 110 or controller modules 120 that are remote(e.g., operational entities 110 outside of the local network that isassociated with routing engine 810). As used herein, an operationalentity 110 or a controller module 120 is said to be “remote” to anotheroperational entity 110 or controller module 120 if they are not withinthe same local network. To determine whether an operational entity 110or a controller module 120 is local or remote, in various embodiments,routing layer 820 accesses information (e.g., a blueprint 210) for aroutable entity 830 that is associated with that operational entity orcontroller module.

Routable entity 830, in various embodiments, is a specializedoperational entity 110 that identifies whether another operationalentity 110 or controller module 120 is remote from routing layer 820. Insome cases, routable entity 830 may include a blueprint 210 or adefinition 310 that specifies a remote host port (e.g., as a variable340). In some embodiments, if routable entity 830 identifies a remotehost port, then routing layer 820 determines that the associatedoperational entity 110 or controller module 120 is remote; otherwise itis local. For example, the information (e.g., blueprint 210) that isassociated with routable entity 830A may indicate that operationalentity 110A is local while the information associated with routableentity 830B may indicate that operational entity 110B is remote. Basedon this information, routing layer 820 may select an appropriatecommunication protocol for communicating with the operational entity 110or controller module 120. To access the appropriate routable entity 830,in various embodiments, routing layer 820 accesses relationshipinformation 330 for the corresponding operational entity 110 orcontroller module 120. The relationship information 330 may identify arelationship 331 between the corresponding operational entity 110 andthe relevant routable entity 830. For example, an operational entity 110may be “contained” within a routable entity 830. Based on this, routinglayer 820 may look up a blueprint 210 for that routable entity 830 froma local storage and/or database 130.

In various embodiments, a controller module 120 is agnostic to whetheran operational entity 110 or controller module 120 is remote or local.That controller module 120 may instead rely on routing layer 820 to makethat determination. From the point-of-view of the controller module,communicating with a local operational entity 110 and a remoteoperational entity 110 may be the same (it may appear as if alloperational entities 110 are local). This may allow the process ofcommunicating with an operational entity 110 to be simplified to onecontrol API instead of using two different control APIs.

Turning now to FIG. 9, a flow diagram of a method 900 is shown. Method900 is one embodiment of a method performed by a controller module(e.g., a controller module 120) for issuing an instruction to anoperational entity (e.g., an operational entity 110) as part of anoperational scenario (e.g., in operational information 135) for a targetcomputer environment (e.g., target environment 137). Method 900 may beperformed by executing a set of program instructions stored on anon-transitory computer-readable medium. In some cases, method 900 maybe performed in response to the controller module receiving aninstruction from a user or another controller module. In someembodiments, method 900 may include additional steps. For example, thecontroller module may route an instruction to another controller modulefor implementation.

Method 900 begins in step 910 with the controller module performing adiscovery procedure. As part of the discovery procedure, in step 912,the controller module identifies components within a hierarchy of atarget computer environment that are to be controlled by the controllermodule. The controller module may identify components within thehierarchy by accessing operational entity information (e.g., operationalentity information 410) defining unique identifiers (e.g., UUIDs 325)that correspond to components within the hierarchy that are to becontrolled by the controller module. In various cases, the hierarchy mayinclude both controller modules and operational entities.

As part of the discovery procedure, in step 914, the controller modulediscovers functional capabilities of the identified components. A givencomponent may implement one or more functions of a plurality offunctions (e.g., functions 710, 720, 730, etc.) supported by a controlapplication programming interface (API) (e.g., control API 700). The oneor more functions may allow for the controller module to change a stateof the given component. The controller module may generate a mappingthat maps a given one of the set of operational entities to a set offunctions implemented by that given operational entity from theplurality of functions supported by a control API. The controller modulemay control a particular operational entity and another particularoperational entity. In various cases, the particular operational entitymay implement a different set of the plurality of functions than theother particular operational entity.

In various embodiments, discovering the functional capabilities of thecomponents may include discovering the functional capabilities of theparticular operational entity by invoking a describe function (e.g.,describe function 710) that is implemented by the particular operationalentity for the control API. In response to invoking the describefunction, the controller module may receive a response from theparticular operational entity that identifies a set of functions of theplurality of functions of the control API implemented by the particularoperational entity.

In step 920, the controller module implements a portion of anoperational scenario for the target computer environment. Theoperational scenario may include updating a component identified duringthe discovery procedure from a first version to a second version. Aspart of implementing the portion of the operational scenario, in step922, the controller module receives, from a component (e.g., anothercontroller module 120) that controls the controller module, aninstruction specifying a particular operation and a particularoperational entity for performing the particular operation.

As part of implementing the portion of the operational scenario, in step922, the controller module generates a response to the instruction usingthe particular operation, the particular operational entity, and thediscovered functional capabilities of the identified components.Generating the response to the instruction may include the controllermodule identifying, from the set of functions, a particular functioninvokable to cause the particular operational entity to perform theparticular operation. The controller module may determine, based on alifecycle value that is indicative of a lifecycle stage, the particularfunction from the set of functions. In some cases, the instruction maydefine a unique identifier associated with the particular operationalentity and thus the controller module may access, based on the uniqueidentifier, a blueprint that corresponds to the particular operationalentity. The blueprint may specify the lifecycle value. The particularfunction may be a transition function (e.g., transition function 730)that is invokable to transition the particular operational entity from afirst state to a second state. The controller module may issue, to theparticular operational entity, a control API call to invoke theparticular function to perform the particular operation. The controllermay further send, to the component that is controlling the particularcontroller module, a message specifying a result that indicates whetherthe particular operation was performed successfully.

In some cases, the controller module may receive another instructionthat specifies another operation and another operational entity forperforming the other operation. The controller module may determine,based on the other instruction, that the other operational entity iscontrolled by another particular controller module. As such, thecontroller module may route the other instruction to the otherparticular controller module.

Turning now to FIG. 10, a flow diagram of a method 1000 is shown. Method1000 is one embodiment of a method performed by a controller module(e.g., a controller module 120) for issuing an instruction to anoperational entity (e.g., an operational entity 110) as part of anoperational scenario (e.g., in operational information 135) for a targetcomputer environment (e.g., target environment 137). Method 1000 may beperformed by executing a set of program instructions stored on anon-transitory computer-readable medium. In some embodiments, method1000 may include additional steps. For example, the controller modulemay route an instruction to another controller module forimplementation.

Method 1000 begins in step 1010 with the controller module, within ahierarchy that includes controller modules and operational entities,receiving an instruction specifying an operation to be performed by anoperational entity as part of an operational scenario. In some cases,the instruction may be received from another controller module withinthe hierarchy that controls the controller module.

In step 1020, the controller module discovers a set of functions (e.g.,functions 710, 720, etc.) implemented by the operational entity from aplurality of functions supported by a control application programminginterface (API) (e.g., control API 700) that allows for a givenoperational entity's state to be changed. Discovering the set offunctions may include the controller module receiving, from theoperational entity, a broadcast that identifies the set of functionsimplemented by the operational entity.

In step 1030, the controller module determines whether the set offunctions includes a function invokable to cause the operational entityto perform the operation.

In step 1040, responsive to determining a particular function invokableto cause the operational entity to perform the operation, the controllermodule invokes the particular function. The particular function may be adestroy function invokable to cause the operational entity to bedestroyed. In various cases, the controller module may send, to theother controller module that sent the instruction, a message thatindicates that the instruction was implemented successfully.

Turning now to FIG. 11, a flow diagram of a method 1100 is shown. Method1100 is one embodiment of a method performed by a controller module(e.g., a controller module 120) for issuing an instruction to anoperational entity (e.g., an operational entity 110) as part of anoperational scenario (e.g., in operational information 135). Method 1100may be performed by executing a set of program instructions stored on anon-transitory computer-readable medium. In some cases, method 1100 maybe performed in response to the controller module receiving aninstruction from a user or another controller module. In someembodiments, method 1100 may include additional steps. For example, thecontroller module may communicate with its operational entities todetermine which functions (e.g., functions 710, 720, etc.) of a controlAPI (e.g., control API 700) have been implemented by those operationalentities.

Method 1100 begins in step 1110 with the controller module receiving aninstruction that identifies a particular operational entity to betransitioned from a first state to a second state as part of automatedimplementation of an operational scenario. The controller module may beincluded within a hierarchy of components having controller modules andoperational entities. In various cases, the hierarchy may include anorchestrator controller module at a top level of the hierarchy that isexecutable to implement the operational scenario by issuing instructionsto controller modules at a next level of the hierarchy. Accordingly, theinstruction may be received by the controller module from theorchestrator controller module as part of implementing the operationalscenario. The operational scenario may include starting up a databaseservice having a set of database servers capable of performing databasetransactions on behalf of users of the computer system that executes thecontroller module.

In step 1120, the controller module causes the instruction to be carriedout for the particular operational entity by making a call to a routinglayer (e.g., routing layer 820). In some cases, the call may not specifywhether the particular operational entity is remote relative to a localenvironment of the controller module. In various embodiments, thecontroller module makes the same call to the routing layer independentof whether the particular operational entity is within the localenvironment or remote to the local environment. The call may specify aparticular function that is implemented by the particular operationalentity for carrying out the instruction. In some cases, the routinglayer may perform the routing operation by invoking the particularfunction. In some cases, the call may be made to the routing layer tocause the routing layer to invoke the particular function of theparticular operational entity to instantiate a database server as partof starting up the database service. In yet some cases, the routinglayer may perform the routing operation by routing the instruction toanother controller module that manages the particular operationalentity.

In various embodiments, the routing layer is operable to make adetermination on whether the particular operational entity is within thelocal environment or remote to the local environment. The routing layermay use the determination to perform a routing operation in relation tothe particular operational entity. In some embodiments, the routinglayer is operable to access a blueprint (e.g., a blueprint 210) for aroutable entity (e.g., a routable entity 830) associated with theparticular operational entity. The routing layer may first access ablueprint for the particular operational entity that specifiesrelationship information (e.g., relationship information 330) for arelationship (e.g., a relationship 331) between the particularoperational entity and the routable entity. That relationship may enablethe routing layer to access the blueprint for the routable entity.

In various embodiments, the routing layer determines that the particularoperational entity is remote to the local environment based on whetherthe blueprint specifies a remote host port. The routing layer may selecta first routing protocol for routing the instruction to the othercontroller module based on the determination indicating that theparticular operational entity is remote to the local environment. Invarious cases, the first routing protocol may be different than a secondrouting protocol usable to route instructions within the localenvironment.

Turning now to Fig. FIG. 12, a flow diagram of a method 1200 is shown.Method 1200 is one embodiment of a method performed by a computer systemto implement a routing layer to route an instruction to an operationalentity (e.g., an operational entity 110) as part of an operationalscenario (e.g., in operational information 135). Method 1200 may beperformed by executing a set of program instructions stored on anon-transitory computer-readable medium. In some embodiments, method1100 may include additional steps.

Method 1200 begins in step 1210 with a routing layer receiving a requestto route an instruction to a particular operational entity that is to betransitioned from a first state to a second state. The request may notspecify whether the particular operational entity is remote relative toa local environment of a controller module from which the request isreceived.

In step 1220, the routing layer makes, based on information maintainedfor the particular operational entity, a determination on whether theparticular operational entity is within the local environment or remoteto the local environment. The information may define a blueprint (e.g.,a blueprint 210) for the particular operational entity. In variouscases, the blueprint may define a relationship (e.g., a relationship331) between the particular operational entity and a routable entitythat is associated with a second blueprint that indicates whether theparticular operational entity is within the local environment or remoteto the local environment. The routing layer may access, based on therelationship, the second blueprint and determine that the particularoperational entity is remote to the local environment based on theaccessed second blueprint specifying a remote host port.

In step 1230, the routing layer routes the instruction to the particularoperational entity based on the determination. As part of routing theinstruction, the routing layer may invoke a particular function (e.g.,transition function 730) that is implemented by the particularoperational entity for transitioning the particular operational entityfrom the first state to the second state. In some cases, as part ofrouting the instruction, the routing layer may send the instruction toanother controller module within a next level of a hierarchy ofcontrollers relative to the controller module from which the request isreceived. This other controller module may directly manages theparticular operational entity.

Turning now to FIG. 13, a flow diagram of a method 1300 is shown. Method1300 is one embodiment of a method performed for issuing an instructionto an operational entity (e.g., an operational entity 110) as part of anoperational scenario (e.g., in operational information 135). Method 1300may be performed by executing a set of program instructions stored on anon-transitory computer-readable medium. In some embodiments, method1300 may include additional steps. As an example, the controller modulemay communicate with the operational entities under its authority todetermine which functions (e.g., functions 710, 720, etc.) of a controlAPI (e.g., control API 700) have been implemented by those operationalentities.

Method 1300 begins in step 1310 with a computer system (e.g., system100) executing a hierarchy of components that include controller modulesand operational entities. In various cases, the hierarchy may include anorchestrator controller module at a top level of the hierarchy that isexecutable to perform an operational scenario by issuing a set ofinstructions to controller modules at a next level of the hierarchy.

In step 1320, a controller module of the hierarchy receives aninstruction that identifies a particular one of the operational entitiesthat is to be transitioned from a first state to a second state.

In step 1330, the controller module causes the instruction to be carriedout by making a call to a routing layer (e.g., routing layer 820). Thecall may not specify whether the particular operational entity is remoterelative to a local environment of the controller module. In someembodiments, the controller module makes the same call to the routinglayer independent of whether the particular operational entity is remoterelative to the local environment of the controller module.

The routing layer may be operable to make a determination on whether theparticular operational entity is within the local environment or remoteto the local environment. The routing layer may use the determination toroute the instruction to the particular operational entity. In variouscases, the routing layer may determine that the particular operationalentity is remote to the local environment in response to the particularoperational entity being associated with a remote host port. The routinglayer may utilize a first routing protocol for routing instructions tooperational entities that are remote to the local environment and asecond, different routing protocol for routing instructions tooperational entities that are within the local environment. In somecases, routing the instruction may include routing the instruction toanother controller module within a next level of the hierarchy thatdirectly manages the particular operational entity.

Turning now to FIG. 14, a block diagram of a workflow engine 440 isshown. In the illustrated embodiment, workflow engine 440 includes aworkflow process engine 1420 and a reserve engine 1440. As furthershown, database 130 includes workflows 1410 and workflow stateinformation 1430, which can be stored at an operational entity 110 asillustrated. In some embodiments, workflow engine 440 may be implementeddifferently than shown. For example, workflow engine 440 may includeworkflow state information 1430.

As noted earlier, operational scenarios may be implemented using a setof commands that correspond to a sequence of steps that perform someintended goal (e.g., updating a set of operational entities 110 to a newversion). A workflow 1410, in various embodiments, specifies an orderedset of commands that correspond to a specific operational scenario.Accordingly, implementing the set of commands of a workflow 1410 mayresult in the associated operational scenario being carried out. In somecases, workflows 1410 may be provided by users and stored in database130; workflows 1410 may be also provided by reasoning engine 450 asdiscussed in greater detail with respect to FIG. 17. A controller module120 may access a workflow 1410 (e.g., from database 130) in response toa workflow request 1405.

Workflow request 1405, in various embodiments, is a request thatinstructs workflow engine 440 to implement a specified workflow 1410.Workflow request 1405 may identify a name or identifier that permits acontroller module 120 to access the corresponding workflow 1410.Workflow request 1405 may be received from a user via a command linetool and/or from another controller module 120. For example, anorchestrator controller module 120 may receive workflow request 1405from a user. In some cases, implementing a workflow 1410 might involveimplementing other, different workflows 1410. In some embodiments,workflows 1410 may be stacked to form a hierarchy of workflows in whicha top level workflow 1410 performs a high-level task and lower levelworkflows 1410 each perform a subtask of that high-level task.Continuing the previous example, implementing the particular workflow1410 specified in the received workflow request 1405 may involve theorchestrator controller module 120 causing lower level controllermodules 120 to implement a set of workflows 1410 that corresponds to theparticular workflow 1410. In order to implement a given workflow 1410,in various embodiments, a controller module 120 includes a workflowprogress engine 1420 and a reverse engine 1440.

Workflow process engine 1420, in various embodiments, is a set ofsoftware routines executable to implement the ordered set of commandsspecified in a workflow 1410. Workflow process engine 1420 may implementa set of commands by issuing instructions to components within system100. As noted previously, in some embodiments, a command may be eitherin a human-readable format or in a format understandable by operationalentities 110 and controller modules 120. As a result, an instructionissued by workflow process engine 1420 might be the actual correspondingcommand or a translation of the command into a format understandable byoperational entities 110 and controller modules 120. Workflow processengine 1420 may issue instructions in the manners discussed earlier(e.g., by interacting with routing engine 810 to makes calls to routinglayer 820).

When implementing an ordered set of commands, in various embodiments,workflow process engine 1420 maintains workflow state information 1430.Workflow state information 1430, in various embodiments, specifies acurrent state of an implementation of a workflow 1410 and/or a currentstate of a target environment 137. For example, workflow stateinformation 1430 may identify commands of a workflow 1410 that havealready been implemented. Accordingly, in response to a command beingcompleted, workflow process engine 1420 may update workflow stateinformation 1430 to reflect that completed command. Workflow stateinformation 1430 may identify the state of a target environment 137 byidentifying the states of the operational entities 110 and controllermodules 120 within that target environment. For example, workflow stateinformation 1430 may indicate which operational entities 110 are“online” and which are “offline.” Workflow state information 1430 mayalso indicate whether the workflow is running forward or in reverse. Asdiscussed below, in response to an error occurring in implementing aworkflow 1410, reverse engine 1440 may use workflow state information1430 to respond to the error.

Reverse engine 1440, in various embodiments, is a set of softwareroutines executable to reverse the state of system 100 back to aninitial state existing before a workflow 1410 was started. In somecases, an error may occur while implementing a workflow 1410. Forexample, a command may fail to complete every time that workflow engine440 attempts to implement it. As another example, workflow engine 440(and its controller module 120) may crash, hang, or experience anothertype of malfunction. If an error occurs while implementing a workflow1410, in various cases, workflow engine 440 may reattempt the relevantstep by implementing the corresponding commands again. In some cases,however, reverse engine 1440 may attempt to reverse the state of system100 back to the initial state associated with the workflow 1410.

In order to reverse the state of system 100, reverse engine 1440 maytraverse the set of commands in a backwards order. In variousembodiments, the set of commands specified in a workflow 1410 can betraversed in a forward order to transition a target environment 137 toan intended state from an initial state and traversed in a backwardsorder to transition the target environment 137 to the initial state froma current state (e.g., the intended state). By traversing the commandsin a backwards order, reverse engine 1440 may get back to a known stateinstead of leaving the system in a broken or unknown state. Accordingly,in response to an error (e.g., a command cannot be completed), reverseengine 1440 may walk backwards through those commands that have alreadybeen implemented, undoing the one or more state changes caused by thosecommands. For example, if a particular command caused an operationalentity 110 to transition from “offline” to “online,” then reverse engine1440 may cause that operational entity 110 to transition back to“offline” (e.g., by invoking a function of control API 700 or issuing aninstruction to a controller module 120 managing the operational entity110). In some cases, a workflow command cannot be reversed—this may beindicated by metadata associated with the command. Accordingly, workflowengine 440 may stop and alert a user to the issue.

In various cases, a controller module 120 (e.g., an orchestratorcontroller module 120) may malfunction (e.g., crash) while implementinga workflow 1410. In such a situation, it may desirable to resumeimplementation of that workflow 1410 once the controller module 120 hasbeen restored (e.g., a new controller module 120 is instantiated).Accordingly, upon recovering or being restored, a controller module 120may attempt to access workflow state information 1430 in order todetermine if there is an in-progress implementation of a workflow 1410.The controller module 120 may subsequently resume implementation of aworkflow 1410 if there is one in-progress. In some cases, the controllermodule 120 may attempt to execute the next command in the workflow 1410;in yet other cases, the controller module 120 may reverse the alreadycompleted commands to return the target environment 137 back to aninitial state. The controller module 120 may then reattempt the entireworkflow 1410.

Because a controller module 120 may malfunction, in various embodiments,workflow state information 1430 is stored at a location external to thecontroller module 120 such that if the controller module 120malfunctions, workflow state information 1430 is not lost. Whether stateinformation 1430 is stored at an external location may also depend onwhether an entity 110 managed by a controller module 120 has “state” andwhether that controller module's life is bound to that entity. As anexample, if a controller module 120 is within the same container as astateless application, it may not store workflow state information 1430externally, but may store it in a local memory. If there was a problemthat caused the container to exit, both that entity 110 and thatcontroller module 120 would be destroyed and the state of the workflowwould then be moot in that case. But, if the state that is being changedpersists outside of that container, then that controller module 120 maystore workflow state information 1430 at a location external to thecontainer. As shown in the FIG. 14, workflow state information 1430 canbe stored at an operational entity 110 and database 130. In someembodiments, if a controller module 120 manages an operational entity110 that includes a database as an element 220, then the controllermodule 120 may utilize that database to store workflow state information1430. When a controller module 120 is initiated, it may invoke thedescribe functions 710 of the operational entities 110 that it managesin order to learn about those operational entities. If an operationalentity 110 is storing workflow state information 1430, then it mayinform the controller module 120 about that information. In this manner,a controller module 120 may learn about an in-progress implementation ofa workflow 1410 along with the corresponding workflow state information1430.

Turning now to FIG. 15, a flow diagram of a method 1500 is shown. Method1500 is one embodiment of a method performed by an orchestratorcontroller module (e.g., a controller module 120) in order to implementa workflow on a target computer environment (e.g., target environment137). Method 1500 may be performed by executing a set of programinstructions stored on a non-transitory computer-readable medium. Insome embodiments, method 1500 may include additional steps. As anexample, the orchestrator controller module may receive a request (e.g.,workflow request 1405) to perform the operational scenario. In somecases, the request may specify a name value corresponding the workflowthat permits the workflow information to be accessed.

Method 1500 begins in step 1510 with the orchestrator controller moduleaccessing workflow information (e.g., operational information 135) thatdefines a workflow (e.g., a workflow 1410) having a set of commands thatcorrespond to a sequence of steps for automatically implementing anoperational scenario on a target computer environment having an initialstate and a set of components that includes controller modules andoperational entities. In some cases, the operational scenario mayinclude starting up a database service having one or more databaseservers capable of performing database transactions on behalf of usersof the computer system.

In step 1520, the orchestrator controller module implements the set ofcommands of the workflow by issuing instructions to ones of the set ofcomponents to cause the sequence of steps to be carried out.Implementing the set of commands may cause one or more state changes inthe target computer environment relative to the initial state. Invarious embodiments, the set of commands are defined such that ones ofthe set of commands can be implemented to transition the target computerenvironment from the initial state to a specified end state and reversedto transition the target computer environment from the current stateback to the initial state. The one or more state changes in the targetcomputer environment may include a state change in which a particularcomponent of the set of components instantiates a new component in thetarget computer environment. In some cases, the new component may have adifferent role in the target computer environment than the particularcomponent.

In step 1530, the orchestrator controller module maintains stateinformation (e.g., workflow state information 1430) that identifies acurrent state of the target computer environment that permits theorchestrator controller module to respond to an error in implementingthe set of commands. In response to detecting that a particular step ofthe sequence of steps failed to be carried out, the orchestratorcontroller module may reattempt the particular step by reissuing, toones of the set of components, one or more instructions corresponding tothe particular step. In some cases, the error may prevent the set ofcommands from being completed. Accordingly, orchestrator controllermodule may respond to the error by reversing the one or more statechanges in the target computer environment to return the target computerenvironment to the initial state. In some embodiments, reversing the oneor more state changes includes traversing backwards through an order inwhich ones of the set of commands have been completed. While performingthe traversing, the orchestrator controller module may undo the one ormore state changes caused by those commands that have been completed.

In some cases, the error includes the orchestrator controller modulecrashing while implementing the set of commands. The state informationmay allow for a reinstated orchestrator controller module tosubsequently resume implementation of the set of commands. In someembodiments, the state information is maintained by the orchestratorcontroller module at a location (e.g., database 130) that is external tothe orchestrator controller module such that the orchestrator controllermodule crashing does not cause the state information to be lost. Thestate information may be maintained by the orchestrator controllermodule at an operational entity within the target computer environment.

Turning now to FIG. 16, a flow diagram of a method 1500 is shown. Method1600 is one embodiment of a method performed in order to implement aworkflow on a target computer environment (e.g., target environment137). Method 1600 may be performed by executing a set of programinstructions stored on a non-transitory computer-readable medium. Insome embodiments, method 1600 may include additional steps. As anexample, the orchestrator controller module may receive a request (e.g.,workflow request 1405) to perform the operational scenario. In somecases, the request may specify a name value corresponding the workflowthat permits the workflow information to be accessed.

Method 1600 begins in step 1610 with a computer system executing ahierarchy of components having controller modules and operationalentities. In various cases, the hierarchy may include an orchestratorcontroller module at a top level of the hierarchy that is executable toimplement an operational scenario by carrying out a set of commands thatcorrespond to a sequence of steps of the operational scenario.

In step 1620, in response to receiving a request to implement aparticular operational scenario for a target computer environment havingan initial state and a set of components of the hierarchy, theorchestrator controller module implementing a workflow having commandscorresponding to the particular operational scenario.

In step 1622, as part of implementing the workflow, the orchestratorcontroller module issues instructions to ones of the controller modulesin the hierarchy to cause the commands of the workflow to be carried outsuch that one or more state changes are made to the target computerenvironment relative to the initial state. In some cases, issuing theinstructions may cause a particular controller module within thehierarchy to implement a second, different workflow. The workflow andthe second workflow may form a hierarchy of workflows that includes theworkflow at a top level of the hierarchy of workflows and the secondworkflow at a next level of the hierarchy of workflows. In some cases,the particular controller module may implement the second workflow byissuing instructions to components in a next level of the hierarchy ofcomponents relative to a level that includes the particular controllermodule.

In step 1624, as part of implementing the workflow, the orchestratorcontroller module maintains state information (e.g., workflow stateinformation 1430) identifying a current state of the target computerenvironment that permits a response to an error in implementing theworkflow. In some cases, the error may include the orchestratorcontroller module hanging while implementing the workflow. Accordingly,the state information may permit a reinstated orchestrator controllermodule to subsequently resume implementation of the workflow. In someembodiments, the state information specifies configuration variables(e.g., variables 340) for the set of components included in the targetcomputer environment.

Turning now to FIG. 17, a block diagram of a reasoning engine 450 isdepicted. In the illustrated embodiment, reasoning engine 450 includes adirect reasoning engine 1710 and an indirect reasoning engine 1720. Asillustrated, reasoning engine 450 can provide a workflow 1410 toworkflow engine 440 for implementation.

In some embodiments, reasoning engine 450 may be implemented differentlythan shown. For example, reasoning engine 450 may operate withoutworkflow engine 440. That is, reasoning engine 450 may generate andimplement steps to move system 100 to an intended state. This mayinvolve the reasoning engine 450 assessing the state of system 100,issuing a command that control API 700 supports, and then reassessingthe state of system 100 until the intended state is reached. Forexample, reasoning engine 450 may receive a reasoning request 1705 totransition an application version from “A” to “X”. Accordingly,reasoning engine 450 may issue a transition command to transition theapplication version from “A” to “X”. But if, for example, a databaseserver associated with the transition command crashes, then reasoningengine 450 may identify this new state of system 100. Accordingly,reasoning engine 450 may generate and issue a new command to transitionthe database server's status from “offline” to “online.” Reasoningengine 450 may then reattempt transitioning the application version from“A” to “X”. In this manner, reasoning engine 450 may implement steps inan order much like workflow engine 440, but the steps can be generatedon the fly or in bulk up front.

Direct reasoning engine 1710, in various embodiments, is a set ofsoftware routines executable to generate a workflow 1410. As shown,reasoning engine 450 can receive a reasoning request 1705. Instead ofspecifying a workflow 1410 to be implemented, reasoning request 1705 mayspecify a high-level goal (e.g., a desired state of the system undermanagement) or a command such as transition version command to beachieved. For example, reasoning request 1705 might specify that adatabase service entity 110 should be instantiated that includes one ormore database server entities 110 and one or more metric server entities110. That reasoning request, however, may also not specify commands forinstantiating the database service entity 110. Accordingly, directreasoning engine 1710 may apply direct reasoning concepts in order togenerate a workflow 1410 that can be implemented to achieve thehigh-level goal. In various cases, direct reasoning engine 1710 may useinformation, such as relationship information 330 included in blueprints210, to identify how operational entities 110 are related. Based on howoperational entities 110 are related, direct reasoning engine 1710 maydetermine that particular operational entities 110 should beinstantiated before other operational entities 110. Based on thisreasoning, direct reasoning engine 1710 may generate an ordered set ofcommands.

Continuing with the previous example, the reasoning request 1705 mayspecify a UUT 321 or a UUID 325 for the database service entity 110.Direct reasoning engine 1710 may use that information to access ablueprint 210 for that database service entity 110. That blueprint mayindicate that the database service entity 110 comprises a databaseserver entity 110 and a metric server entity 110. Based on the databaseservice entity's blueprint 210, direct reasoning engine 1710 may accessa blueprint 210 for the database server entity 110 and a blueprint 210for the metric server entity 110. Those blueprints 210 may indicate arelationship 331 between the database server entity 110 and the metricserver entity 110. In some cases, the relationship 331 might indicatethat the metric server entity 110 depends on the existence of thedatabase server entity in order for the metric server entity 110 tooperate correctly. Accordingly, direct reasoning engine 1710 maydetermine, based on the relationship, that the database server entity110 needs to be instantiated before the metric server entity 110. Basedon that determination, direct reasoning engine 1710 may generate a setof commands that includes a command for instantiating the databaseserver entity 110, where the set of commands are ordered such that thatcommand comes before another command for instantiating the metric serverentity 110.

Indirect reasoning engine 1720, in various embodiments, is a set ofsoftware routines executable to generate a workflow 1410. In contrast todirect reasoning engine 1710, indirect reasoning engine 1720 may applyindirect reasoning concepts in order to generate a workflow 1410. Forexample, a database table might have a lot of expensive scans and apossible solution might be to create an index. Accordingly indirectreasoning engine 1720 may determine that an index should be created forthat database table (e.g., by analyzing information that indicates thatan index has been beneficial for other database tables that hadexpensive scans). Indirect reasoning engine 1720 may generate a workflow1410 having a set of commands to create the index for that databasetable. After a workflow 1410 has been generated by reasoning engine 450,the workflow 1410 may be provided to workflow engine 440 forimplementation. In some cases, a workflow 1410 may be stored (e.g., atdatabase 130) so that the workflow 1410 can be retrieved to implementthe high-level goal again without having to be regenerated.

Turning now to FIG. 18, a flow diagram of a method 1800 is shown. Method1800 is one embodiment of a method performed by an orchestratorcontroller module (e.g., a controller module 120) in order to generateand implement a workflow (e.g., a workflow 1410) on a target computerenvironment (e.g., target environment 137). Method 1800 may be performedby executing a set of program instructions stored on a non-transitorycomputer-readable medium. In some embodiments, method 1800 may includeadditional steps. As an example, the orchestrator controller module maystore a generated workflow in a database to permit the operationalscenario to be re-implemented without having to regenerate the workflow.

Method 1800 begins in step 1810 with the orchestrator controller modulereceiving a request (e.g., reasoning request 1705) to implement anoperational scenario to transition a target computer environment from afirst state to a second, different state. The target computerenvironment may have a set of components that include controller modulesand operational entities. In various cases, the received request may notspecify commands for transitioning the target computer environment fromthe first state to the second state. The request may identify first andsecond components to be instantiated in the target computer environmentas part of the operational scenario. The operational scenario mayinclude starting up a database service having a set of database serverscapable of performing database transactions on behalf of users of thecomputer system.

In step 1820, the orchestrator controller module generates a workflowthat defines a particular set of commands to transition the targetcomputer environment from the first state to the second state, includingby changing states of ones of the set of components. In various cases,the orchestrator controller module may access blueprints (e.g.,blueprints 210) that correspond to the set of components, the firstcomponent, and the second component. The blueprints may definerelationships between components that affect an order in which theparticular set of commands are implemented. For example, therelationships may include a dependence relationship in which the firstcomponent depends on the existence of the second component in order forthe first component to operate in a valid manner. As such, theparticular set of commands may include a first command to instantiatethe first component and a second command to instantiate the secondcomponent. The particular set of commands may be ordered based on thedependence relationship such that the second command precedes the firstcommand in implementation.

In step 1830, the orchestrator controller module implements theparticular set of commands by issuing instructions to one or morecontroller modules in the set of components to transition the targetcomputer environment to the second state. In some cases, the particularset of commands may be defined such that the particular set of commandscan be implemented in a forward order to transition the target computerenvironment from the first state to the second state and implemented ina backwards order to transition the target computer environment to thefirst state. In response to detecting an error in implementing theparticular set of commands, the orchestrator controller module maytransition the target computer environment from a current state back tothe first state according to the backwards order. In some cases,changing the states of ones of the set of components may includeupdating an operational entity from a first version to a second version.

Turning now to FIG. 19, a flow diagram of a method 1900 is shown. Method1900 is one embodiment of a method performed in order to generate andimplement a workflow (e.g., a workflow 1410) on a target computerenvironment (e.g., target environment 137). Method 1900 may be performedby executing a set of program instructions stored on a non-transitorycomputer-readable medium. In some embodiments, method 1900 may includeadditional steps. For example, the orchestrator controller module maystore a generated workflow in a database to permit the operationalscenario to be re-implemented without having to regenerate the workflow.

Method 1900 begins in step 1910 with a computer system executing ahierarchy of components having controller modules and operationalentities. The hierarchy may include an orchestrator controller module ata top level of the hierarchy that is executable to implement anoperational scenario by carrying out a set of commands that correspondto a sequence of steps of the operational scenario.

In step 1920, in response to receiving a request (e.g., reasoningrequest 1705) to implement a particular operational scenario totransition a target computer environment from an initial state to an endstate, the orchestrator controller module generates a workflow thatdefines a particular set of commands to transition the target computerenvironment from the initial state to the end state. In various cases,the request may not identify the particular set of commands. Theparticular operational scenario may involve creating an operationalentity. As such, generating the workflow may include accessing ablueprint (e.g., a blueprint 210) for the operational entity. In somecases, the blueprint may identify a second operational entity that is tobe created in addition to the operational entity. Consequently, theorchestrator controller module may determine, based on a relationshipbetween the operational entity and the second operational entity, anorder in which to create the operational entity and the secondoperational entity. The particular set of commands may be generatedbased on the determined order.

In step 1930, the orchestrator controller module implements theparticular set of commands by issuing instructions to one or morecontroller modules in the hierarchy of components to transition thetarget computer environment to the end state, including by changingstates of ones of the components of the hierarchy.

Turning now to FIG. 20A, a block diagram of authorization service 140 isdepicted. As mentioned above, it can be important to ensure that anactor is not able to issue unauthorized instructions to the system 100to achieve some undesired ends. To protect system 100, authorizationservice 140 may be employed to audit the actions being requested by anactor interfacing with the system 100. In the illustrated embodiment,authorization service 140 includes authorization engine 2010,authorization sheets 2020, and test engine 150 and may interface withdatabase 130 including audit reports 2030. As shown in FIG. 20A, in someembodiments, authorization service 140 is a separate authorizationcomponent from controller modules 120 to audit commands issued tocontroller modules 120 and/or operational entities 110. In otherembodiments, authorization service 140 may be implemented differentlythan shown. For example, as will be discussed below with FIG. 20B,authorization service 140 may be integrated into controller modules 120(or orchestrator controller module 120), authorization service 140 maynot include test engine 150, etc.

Authorization engine 2010, in various embodiments, is a set ofinstructions executable to perform the auditing of issued commands. Forexample, a particular user may have issued a command to an orchestratorcontroller module 120 to transition a specific database cluster fromonline to offline. Before the orchestrator controller module 120 beginsimplementing this command, authorization engine 2010 may confirm whetherthe particular user (or a controller module 120 if it had issued thecommand) is authorized to issue such a command. In the illustratedembodiment, authorization service 140 receives indications of whatcommands have been issued via authorization requests 2005. Accordingly,when a controller module 120 receives a command from an entity (e.g., auser, a higher-level controller module 120, etc.) to perform one or moreactions, the controller module 120 may send a request 2005 to authorizeperformance of a received command and identify various information aboutthe command such as the actions to be performed, the issuer of thecommand, or other contextual information about the command. In theillustrated embodiment, authorization engine 2010 evaluates theinformation included in authorization requests 2005 against a set ofsecurity rules defined in authorization sheets 2020 in order to verifythat the issued commands comply with the permissible actions defined bythe set of security rules. One example of authorization sheets 2020 isdiscussed below in greater detail below with respect to FIG. 21.

As part of the audit process, in various embodiments, authorizationengine 2010 (or others component of service 140 or system 100) mayauthenticate various entities associated with authorization requests2005, which may occur before (or after) receiving requests 2005. In someembodiments, this includes authenticating the initial issuer (e.g., auser or a controller module 120) of the command. Accordingly, if a useris the issuer, an authentication prompt asking for a username andpassword may be presented to the user to confirm his or her identity. Insome embodiments, this includes authenticating the controller module 120making an authorization request 2005. In some embodiments, to implementthis authentication, operational entities 110, controller modules 120,and/or authentication service 140 may be provisioned with certificatesfor public-key pairs by maintained by components 110, 120 and 140.Components 110, 120, and 140 may then exchange these certificates tomutually authenticate one another and establish secure communicationlinks. For example, a controller model 120 and authorization service 140may exchange certificates in an Elliptic-curve Diffie-Hellman (ECDH)exchange to mutually authenticate and establish a shared secret for aTransport Layer Security (TLS) session through which authorizationrequests 2005 and authorization responses 2015 may be securelyexchanged.

In various embodiment, the auditing performed by authorization engine2010 also includes maintaining a log of audit reports 2030 in database130. Accordingly, when an authorization request 2005 is received,authorization engine 2010 may record various information about therequest 2005 in an audit report 2030. This information may include whoissued the command such as a user's or component's name (e.g., anidentifier value of the orchestrator controller module 120) as well aswho is the command's target such as a name or UUID of a controllermodule 120 or operational entity 110. This information may include whataction or actions are being instructed by the command. This informationmay include when the command was issued. This information may include anindication of the command origin such as an IP address, UDP or TCP portnumber, etc. Authorization engine 2010 may also record information aboutthe corresponding authorization response 2015 such as whether a givenrequest 2005 was granted or denied—and the reasons for denial in such anevent. In some embodiments, database 130 may restrict authorizationservice 140's access to audit reports 2030 such that service 140 ispermitted to write audit reports 2030 but not to delete any reports2030. Thus, audit reports 2030 may be preserved even if authorizationservice 140 becomes compromised or an authorized manager of service 140attempts to abuse his or her access privileges.

Based on its evaluation of authorization sheets 2020 for a given request2005, authorization engine 2010 may issue a corresponding response 2015indicating whether a received command is authorized or not. As will bediscussed below with respect to FIG. 22, in some embodiments,authorization engine 2010 may include a signed token in itsauthorization response 2015 that is usable by subsequent components(e.g., controller modules 120 and/or operational entities 110) toconfirm that performance one or more actions identified in an issuedcommand have been authorized by service 140. In doing so, an initialcontroller module 120 (e.g., orchestrator controller module 120) mayhandle interaction with authorization service 140 to obtain a token andthen can pass the token on to the one or more other componentsperforming the actions and who verify that approval for the actions hasbeen granted without having to contact authorization service 140 againfor the same approved actions.

Turning now to FIG. 20B, another block diagram of authorization service140 is depicted. In the illustrated embodiment depicted in FIG. 20Adiscussed above, components of authorization service 140 are distinctfrom controller modules 120 and/or operational entities 110. In someembodiments, however, components of authorization service 140 may beinterspersed among components 120 and/or 110. For example, as shown inFIG. 20B, an instance of authorization engine 2010 may be included in acontroller module 120 in order to verify that commands received by thecontroller module 120 comply with the permissible actions defined byrules within authorization sheets 2020. In the illustrated embodiment,authorization service 140 also include a distributer 2040 that maintainsa master copy of authorization sheets 2020A and distributes copies ofauthorization sheets 2020B to each instance of authorization engine 2010to facilitate its locally performed evaluations. In some embodiments, alocal copy of authorization sheets 2020B may contain only the rulesapplicable to that engine 2010 rather than a full copy of each rulecontained in authorization sheets 2020A. In various embodiments,distributer 2040 also signs each copy of authorization sheets 2020B topreserve its integrity. In some embodiments, authorization service 140may be implemented differently than shown. For example, each instance ofauthorization engine 2010 may be responsible for maintaining its ownsheets 2020 rather than receiving a copy of sheets 2020 from acentralized entity, instances of authorization engine 2010 may also belocated within operational entities 110, etc.

Turning now to FIG. 21, a block diagram of an authorization sheet 2020is depicted. As shown, authorization sheets 2020 may include a list ofrules 2100, which, as mentioned above, may be evaluated by authorizationengine 2010 when determining whether to grant an authorization request2005. In the illustrated embodiment, rules 2100 include permissions2102, subjects 2104, actions 2106, and other parameters 2108. In otherembodiments, rules 2100 may include other suitable criteria forevaluating issued commands.

Permissions 2102, in various embodiments, define whether a given rule2100 grants rights or restricts rights. For example, the permission ofrule 2100A indicates that it is restrictive with respect to the subject2104 John while the permission of rule 2100B indicates that it ispermissive with respect to the subject 2104 Jan.

Subjects 2104, in various embodiments, identify the issuer/requester(i.e., the one issuing the command being evaluated) with respect to agiven rule 2100. For example, the subjects 2104 for the rules 2100B and2100C indicate that both rules 2100 pertains to the requester Jan.Accordingly, when a request 2005 is received for a given command to beissued to one of the components in the hierarchy, authorization engine2010 may verifying whether a requester of the command corresponds toauthorized requester identified by subjects 2104. Although the examplesdepicted in FIG. 21 are user names, other forms of identification may beused such as IP addresses, UUIDs, etc.

Actions 2106, in various embodiments, identify actions acceptable orunacceptable with respect to a given rule 2100. For example, actions forrules 2100B and 2100C indicate that subject Jan is allowed to requestthe actions “create” and “transition.” In the case of rule 2100A, anasterisk is used to reject all actions with respect to subject 2104John. Accordingly, when a request 2005 is received for a given command,authorization engine 2010 may, in additional to verifying elements 2102and 2104, also verify that whether an action to be performed by thecommand is one of the authorized actions 2106.

Parameters 2108, in various embodiments, include various additionalcriteria associated with a given action 2106. For example, rule 2100Bspecifies the parameters of “DB” and “instance” to indicate that theaction “create” for “Jan” is restricted to instances of databases. Otherexamples of parameters 2108 may include time restrictions (e.g., whenaction can (or cannot) be requested), target restrictions (e.g.,identifying a particular UUID for a target where an action may (or maynot) be performed), IP address restrictions, etc.

In various embodiments, authorization service 140 provides a userinterface, which may be a command line interface or graphical userinterface, to allow a security team to set various ones of rules 2100.In some embodiments, the security team is distinct from the potentialusers administrating the system. The rules 2100 may also be signed,downloaded, and validated periodically to ensure that they have not betampered with.

Turning now to FIG. 22, a block diagram of an exchange using a token2200 is depicted. As mentioned above, in some embodiments, authorizationservice 140 may issue a token 2200 that is usable by controller modules120 and/or operational entities 110 to confirm that a set of actionsassociated with a received command has already been authorized byauthorization service 140. For example, a first controller module 120Amight receive a first issued command 2210A to create multiple instancesof a database and, to implement this command 2210A, intend to issue asecond command 2210B to each controller module 120 (or operationalentity 110) handing creation of a respective one of the databaseinstances. In the illustrated embodiment, controller module 120A canissue an authorization request 2005 corresponding to the received firstcommand 2210A. In response to approving the request 2005, authorizationservice 140 may send back an authorization response 2015 that includes atoken 2200 indicating that the various actions needed to create thedatabase instances have been authorized. Controller module 120A can theninclude the token 2200 in the second set of commands 2210B issued tosubsequent controller modules 120 and/or operational entities 110, whichcan determine, from token 2200, what actions have already beenauthorized and begin performing them without having to recontactauthorization service 140 for permission to perform those actions.

Token 2200 may include any suitable content for facilitatingconfirmation that performance of commands 2210 has been authorized byauthorization service 140. In the illustrated embodiment, token 2200includes access rights 2202, timestamp 2204, and signature 2206. Inother embodiment, token 2200 may include more (or less) components thanshown. In some embodiments, token 2200 may be implemented as a JSON webtoken (JWS), Kerberos token, X.509 certificate, or some other standardformat for a signed attestation.

Access rights 2202, in various embodiments, indicate a set of particularactions that have been approved for performance by authorization service140 and may, in general, include various elements from rules 2100discussed above. Accordingly, a given right 2202 may identify not onlyidentify a given action 2106 but also indicate a particular subject 2104permitted to issue a command for that action 2106. Thus, in response toreceiving a command 2210B from controller module 120A to perform aparticular action, a controller module 120B may verify that controllermodule 120A is identified in access rights 2202 as being permitted torequest the particular action. In some embodiments, access rights 2202may also identify the targets authorized to perform particular actions,which may be identified using an IP address, UUID, etc. Accordingly, acontroller module 120B receiving a token 2200 may confirm that it isidentified in token 2200 as an authorized target to perform a particularaction identified in command 2210B.

Timestamp 2204 and signature 2206, in various embodiments, are includedto facilitate verification for a token 2200 by subsequent recipientssuch as controller modules 120 or operation entities 110. In general,timestamp 2204 may be some restriction for how long an issued token 2200is valid. Accordingly, in one embodiment, timestamp 2204 may be a timevalue indicating when a token 2200 was issued, and components 120 and110 may be operable to accept a token 2200 only within some window aftertimestamp 2204. In another embodiment, timestamp 2204 may be a starttime and a stop time indicating a window in which actions authorized byaccess rights must be performed. In yet another embodiment, timestamp2204 may indicate an expiration time value after which token 2200 is nolonger valid. Signature 2206 may generally be used to ensure that theintegrity of token 2200 is preserved—or, said differently, that token2200 has not been tampered with (or is a counterfeit). Accordingly, insome embodiments, signature 2206 is generated from the contents of token2200 by a private key maintained by authorization service 140 and havinga corresponding trusted public key known to components 120 and 110. Inresponse to receiving a token 2200, a component 110 or 120 may use thepublic key to verify the signature 2206 against the contents of token2200 before performing any actions identified in issued command 2210B.

Turning now to FIG. 23, a flow diagram of a method 2300 is shown. Method2300 is one embodiment of a method performed by a computer system havingan authorization service associated with a target computing environmentsuch as authorization service 140. In various embodiments, performanceof method 2300 may improve the security of the target computingenvironment.

Method 2300 begins in step 2310 with the computer system storing a setof security rules (e.g., rules 2100 included in authorization sheets2020) defining permissible actions within a hierarchy of components(e.g., operational entities 110, controller modules 120, etc.) forimplementing an operational scenario within a target computingenvironment. In step 2320, the computer system implements theoperational scenario within the target computing environment includingissuing a set of commands to components within the hierarchy andverifying that the set of commands complies with the permissible actionsdefined by the set of security rules.

In various embodiments, issuing the set of commands includes a firstcomponent of the hierarchy sending, to an authorization service (e.g.,authorization service 140) performing the verifying, an authorizationrequest (e.g., authorization request 2005) for a particular issuedcommand and, in response to the authorization service determining thatthe particular command complies with the permissible actions defined bythe set of security rules, the first component receiving a response(e.g., authorization response 2015) authorizing performance of thecommand. Based on the authorizing response, the first component performsone or more actions identified in the issued command. In someembodiments, the verifying includes the authorization serviceauthenticating a source of the authorization request prior to sendingthe authorizing response. In some embodiments, the verifying includesthe authorization service storing, in a log, a report (e.g., auditreports 2030 in database 130) identifying reception of the authorizationrequest. In some embodiments, the received authorizing response includesa token (e.g., token 2200) indicating that a second component in thehierarchy is authorized to perform of a particular action, andperforming the one or more actions includes the first component issuing,to the second component, a command (e.g., second issued command 2210B)including the token, the token being verifiable by the second componentto confirm performance of the particular action has been authorized. Insome embodiments, the token identifies the particular action (e.g., inaccess rights 2202), the first component, and a signature (e.g.,signature 2206) of the authorization service.

In various embodiments, the set of rules includes a rule identifying anauthorized requester (e.g., subject 2104) and one or more authorizedactions (e.g. actions 2106) associated with the authorized requester,and the verifying includes receiving an indication of a command to beissued to one of the components in the hierarchy, verifying whether arequester of the command corresponds to the authorized requester, andverifying whether an action to be performed by the command is one of theauthorized actions.

In various embodiments, method 2300 further includes a first componentin the hierarchy receiving a request to issue a command to a secondcomponent in the hierarchy, and the verifying includes the firstcomponent verifying (e.g., using authorization sheets 2020B) that therequested command complies with the permissible actions defined by theset of security rules prior to the first component issuing the commandto the second component. In some embodiments, the second component is anoperational entity (e.g., operational entity 110) operable to performthe issued command. In some embodiments, the second component is acontroller module (e.g., a controller module 120) operable to cause oneor more operational entities to perform the issued command.

Turning now to FIG. 24, a block diagram of testing engine 150 isdepicted. As mentioned above, performing adequate testing can beimportant for ensuring that a system operates reliably. In manyinstances, however, it may be difficult to test every possible statethat a system may experience during its lifetime—particularly when suchtesting is performed manually. As will be discussed below, in variousembodiments, test engine 150 is employed to automate testing of system100 through injection of various fault conditions in order to identifystates in which system 100 fails to function properly. In theillustrated embodiment, test engine 150 includes a scan engine 2410,pre-scan graph 2420, post-scan graph 2430, and perturb engine 2440. Inother embodiments, test engine 150 may be implemented differently thanshown.

Scan engine 2410, in various embodiments, handles collection ofinformation about controller modules 120 and operational entities 110 inorder to facilitate operation of test engine 150. In some embodiments,this collection begins with performance of a discovery operation inwhich scan engine 2410 attempts to learn about the various controllermodules 120 and operational entities 110 within system 100. Accordingly,scan engine 2410 may initially send a request 2412 asking orchestratorcontroller module 120 to describe itself and identify other controllermodules 120, which directly (or indirectly in some embodiments) interactwith orchestrator controller module 120. In some embodiments,orchestrator controller module 120 may send a response 2414 including agraph data structure identifying the controller modules 120 andoperational entities 110 of system 100 as well as describing theirarrangement. Based on this received information, scan engine 2410 maythen send description requests 2412 to the newly discovered controllermodules 120 and operational entities 110. These components may then sendcorresponding description responses 2414, which may include any ofvarious suitable information. For example, a given controller module 120or operational entity 110 may include a general description of itself,which may include identifying its role in system and includinginformation such as its universally unique identifier (UUID), vendor,version, relationships to other controller modules 120 and/oroperational entities 110, attributes, configuration variables, etc. Insome embodiments, a given controller module 120 may also identify in aresponse 2414 various application programmable interface (API)functionality supported by it. For example, a controller module 120 maysupport API calls from scan engine 2410 to retrieve information about acontrolled operational entity 110 such as fetching configurationinformation, logs, metrics, facts, etc. In various embodiments,controller modules 120 may also identify in their responses 2414 whatinjectable fault conditions are supported and can be request by testengine 150. For example, a controller module 120 that controls multipledatabase operational entities 110 may advertise that it supports killinga database instance (or a killing a container including a databaseinstances), halting execution of a database instances, starving adatabase instance, etc.

In various embodiments, scan engine 2410 also collects various stateinformation about the state of system 100 before injection of a faultcondition and the state of system 100 after injection of a faultcondition. Accordingly, scan engine 2410 may collect this informationthrough the issuance of requests 2412 and reception of responses 2414 asdiscussed above. In some embodiments, controller modules 120 may alsoprovide real-time telemetry data to scan engine 2410. For example, acontroller module 120 maintaining database instances may indicate howmany database instances are currently in operation and notify scanengine 2410 when that number changes. As noted above, in someembodiments, scan engine 2410 may receive information through testengine 150's integration into authorization service 140. For example, ifa controller module 120 has been issued a command to provision anotherdatabase instance, scan engine 2410 may learn of this issued commandwhen the controller module 120 sends an authorization request 2005 toauthorization service 140 to ask permission to implement the command. Inthe illustrated embodiment, metadata collected about the state beforethe fault-condition injection may be assembled into pre-scan graph 2420,and metadata collected about the state after the fault-conditioninjection may be assembled into post-scan graph 2430. As will bedescribed below, scan engine 2410 (or some other component of testengine 150 in other embodiments) may compare these graphs 2420 and 2430in order to glean insight into how an injected fault condition affectssystem 100. In some embodiments, to facilitate organization of thismetadata and subsequent comparison of graphs 2420 and 2430, scan engine2410 assembles graphs 2420 and 2430 as respective graph data structures.Accordingly, each node in pre-scan graph 2420 may correspond to arespective controller module 120 or operational entity 110 within system100 and may include various metadata collected about the state of thatmodule 120 or entity 110 before injection. Edges between nodes maycorrespond to relationships that exist between controller modules 120and operational entities 110. Each node in post-scan graph 2430 may besimilarly organized and include metadata about a given controller module120's or operational entities 110's state after injection of a faultcondition. Scan engine 2410 may then determine how a given injectedfault condition affected system 100 by identifying what nodes have beenaltered between pre-scan graph 2420 and post-scan graph 2430 and thenexamining the contents of altered nodes to determine specific detailsresultant from the injected fault condition.

Perturb engine 2440, in various embodiments, is responsible forselecting fault conditions for injection and sending perturbinstructions 2418 to the appropriate controller modules to cause theirinjection. These fault conditions may correspond to any suitableconditions that may cause system 100 to experience a fault. For example,in some embodiments, perturb engine 2440 may issue a perturb instruction2418 to kill, suspend, halt, hang, or terminate an operational entity110 to see its effect on system 100. In some embodiments, perturb engine2440 may issue a perturb instruction 2418 to alter the resourcesavailable to an operational entity 110 to starve or overload the entity110. For example, perturb engine 2440 may alter the processing resourcesavailable to an operational entity 110 causing the operational entity110 to be assigned a lower execution priority, scheduled less frequentlyfor execution, allocated less processors for execution, etc. Perturbengine 2440 may alter the memory resources available to an operationalentity 110 by allocating it less volatile or non-volatile storage,swapping out pages to memory, etc. Perturb engine 2440 may alter thenetwork resources available to an operational entity 110 by reducing thenetwork bandwidth available for communications with the operationalentity 110, increasing a latency for communications with the operationalentity 110, dropping communications with the operational entity 110,disconnecting a network connection of the operational entity 110, etc.In various embodiments, perturb engine 2440 may inject fault conditionsto interfere with the interdependencies of operational entities 110within system 100. For example, an application server (a firstoperational entity 110) may rely on data stored in a database server (asecond operation entity 110). To test a resiliency of the applicationserver, perturb engine 2440 may corrupt the data in the database (ormerely crash the database) to determine the effect on the applicationserver. As another example, two or more operational entities 110 maywork together in lockstep to achieve some purpose, and perturb engine2440 may attempt to halt operation of one of the entities 110 todetermine whether a deadlock can be successfully achieved. As anotheryet example, an operational entity 110 may rely on configuration datastored in a configuration file, and perturb engine 2440 may alter (oreven corrupt) that data to interfere with its operation. Perturb engine2440 may also inject other real-world fault conditions such as causingpower failures, disconnecting blade servers, causing network switchfailures, etc. As noted above, these fault conditions may be injected onthe actual system while it is running/live (as opposed to operating onsome theoretical model of the system).

Perturb engine 2440 may employ any suitable selection algorithm fordetermining what fault conditions to inject. In some embodiments,perturb engine 2440 may randomly select fault conditions and issuecorresponding instructions 2418 to have those fault conditions injected.In various embodiments, perturb engine 2440 may be instructed to targeta particular aspect of system 100, such as a particular operationalentity 110 or group of entities 110, and select a fault conditionassociated with that aspect. In various embodiments, perturb engine 2440monitors the commands being issued to controller modules 120 and/oroperational entities 110 and selects fault conditions for injectionbased on the issued commands. For example, test engine 150 may beinstructed to target an update process being performed with respect tosystem 100. In response to a controller module 120 providing anindication 2416 that a particular command has been issued to it withrespect to the update process, perturb engine 2440 may select acorresponding fault condition and issue the appropriate perturbinstruction 2418 in order to attempt to interfere with the updateprocess. The selected fault condition may, for example, includeterminating execution of an operational entity 110 being updated duringthe update process such as crashing a container including a databaseinstance that is undergoing an update. The selected fault condition may,as another example, include increasing a network latency forcommunications with an operational entity 110 being updated during theupdate process in an attempt to cause a failure associated with theupdate. In some embodiments, perturb engine 2440 may maintain historyinformation identifying previously injected fault conditions anddetermine, for each a set of fault conditions being considered forselection, a respective entropy score that indicates how different thatfault condition is relative to what was previously injected asdetermined from the history information. Perturb engine 2440 may thenselect the fault condition having the entropy score indicating that itis the most different (or, at least, sufficiently different) from whatwas previously selected. In some embodiments, perturb engine 2440 maymaintain history information identifying previously injected faultconditions that produced faults in system 100 and may reselect thosefault conditions after an attempt has been made to correct for thoseconditions in order to determine whether those corrections have beensuccessful.

As mentioned above, scan engine 2410 (or some other component) maycompare metadata from pre-scan graph 2420 and post-scan graph 2430 inorder to glean better insight about system 100. In some instances, thiscomparison may be performed to determine what may be affected by aparticular injected fault condition. Such a determination may includescan engine 2410 identifying which operational entities 110 are directlyaffected by an injected fault condition—as well as identifying whichoperational entities 110 may indirectly be affected by an injected faultcondition due to an unforeseen relationship between entities 110. Forexample, issuing a perturb instruction 2418 to crash one operationalentity 110 might reveal that another operational entity 110 crashes—andthus some unperceived dependency may exist. Such a determination mayalso be used to establish that no operational entities 110 is affected(or at least not affected to the point of experiencing a fault) by aninjected fault condition. For example, scan engine 2410 may determinethat reducing a network connection's bandwidth by a particular amountdoes not result in a failure of operational entity 110 using the networkconnection. In some instances, this comparison may be performed todetermine system 100's resiliency to a fault condition. For example, acontroller module 120 may be instructed to maintain a particular numberof instances of an operational entity 110. Perturb engine 2440 may thenissue a perturb instruction 2418 to kill one of the instances of theoperational entity 110 in order to determine whether the controllermodule 120 instantiates another instance of the operation entity 110 inresponse to the killing. In this example, a successful outcome may bethat no difference is identified when pre-scan graph 2420 and post-scangraph 2430—meaning that system 100 was able to recover after thecontroller module 120 was able to successfully instantiate a newinstance of the operational entity 110 to replace the previously killedone.

In many instances, using test engine 150 in this manner may allow system100 to be thoroughly tested in order to better understand operation ofsystem 100. With the knowledge obtained from test engine 150,administrators may be able to better identify potential vulnerabilitiesand take corrective actions to address them. Administrators can also bemore confident in knowing that a well-tested system can operate asdesigned when adverse conditions arise. Moreover, test engine 150 mayautomatically and thoroughly explore the fault behavior of any component(independent of type) deployed within system 100.

Turning now to FIG. 25, a flow diagram of a method 2500 is shown. Method2500 is one embodiment of a method performed by a computer systemtesting a target computing environment such as a computer systemincluding test engine 150. In many instances, performance of method 2500may be usable to identify issues that, when corrected, improve theresiliency of the target computing environment.

Method 2500 begins in step 2510 with the computer system implementing anoperational scenario within a target computing environment having ahierarchy of components including controller modules (e.g., controllermodules 120) and operational entities (e.g., operational entities 110).In various embodiments, the implementing includes issuing a set ofcommands to components within the target computing environment. In step2520, the computer system receives an indication (e.g., commandindication 2416) that a particular one of the set of commands has beenissued. In step 2530, in response to receiving the indication, thecomputer system instructs (e.g., via a perturb instruction 2418) one ofthe controller modules to inject a fault condition with respect to oneof the operational entities to test the target computing environment.

In various embodiments, method 2500 further includes the computer systemcollecting metadata (e.g., pre-scan graph 2420) about a first state ofthe target computing environment before injection of the faultcondition, collecting metadata (e.g., post-scan 2430) about a secondstate of the target computing environment after injection of the faultcondition, and comparing the metadata about the first state and themetadata about the second state to determine an effect of the faultcondition. In some embodiments, the computer system assembles a firstgraph data structure from the collected metadata about the first state,assembles a second graph data structure from the collected metadataabout the second state, and compares the first graph data structure withthe second graph data structure. In some embodiments, the computersystem determines, based on the received indication, that the particularcommand is associated with an update process to update one or more ofthe components in the hierarchy and selects a fault condition to attemptto interfere with the update process. In one such embodiment, theselected fault condition includes terminating execution of anoperational entity being updated during the update process. In one suchembodiment, the selected fault condition includes increasing a latencyfor communications with an operational entity being updated during theupdate process.

In various embodiments, method 2500 further includes the computer systemperforming a discovery operation (e.g., via requests 2412 and responses2414) to identify a set of injectable fault conditions supported by thecontroller modules and, based on the particular issued command,selecting one of the set of injectable fault conditions for injection bythe instructed controller module. In some embodiments, the discoveryoperation includes the computer system contacting an orchestrator (e.g.,orchestrator controller module 120) of the hierarchy to determineidentities of one or more of the controller modules, the orchestratorbeing a controller module that issues commands to other controllermodules. In some embodiments, the discovery operation includes thecomputer system sending, based on the determined identities and to theone or more controller modules, requests asking for the one or morecontroller modules to identify injectable fault conditions supported bythe one or more controller modules. In some embodiments, the computersystem maintains history information identifying previously injectedfault conditions and, based on the history information, determines arespective difference score for each of the set of injectable faultconditions, each difference score being indicative of a difference ofthat fault condition relative to the previously injected faultconditions. In such an embodiment, the selecting of the fault conditionfor injection is further based on the determined difference scores.

Exemplary Computer System

Turning now to FIG. 26, a block diagram of an exemplary computer system2600, which may implement a system 100, operational entity 110,controller module 120, database 130, and/or authorization service 140,is depicted. Computer system 2600 includes a processor subsystem 2680that is coupled to a system memory 2620 and I/O interfaces(s) 2640 viaan interconnect 2660 (e.g., a system bus). I/O interface(s) 2640 iscoupled to one or more I/O devices 2650. Computer system 2600 may be anyof various types of devices, including, but not limited to, a serversystem, personal computer system, desktop computer, laptop or notebookcomputer, mainframe computer system, tablet computer, handheld computer,workstation, network computer, a consumer device such as a mobile phone,music player, or personal data assistant (PDA). Although a singlecomputer system 2600 is shown in FIG. 26 for convenience, system 2600may also be implemented as two or more computer systems operatingtogether.

Processor subsystem 2680 may include one or more processors orprocessing units. In various embodiments of computer system 2600,multiple instances of processor subsystem 2680 may be coupled tointerconnect 2660. In various embodiments, processor subsystem 2680 (oreach processor unit within 2680) may contain a cache or other form ofon-board memory.

System memory 2620 is usable store program instructions executable byprocessor subsystem 2680 to cause system 2600 perform various operationsdescribed herein. System memory 2620 may be implemented using differentphysical memory media, such as hard disk storage, floppy disk storage,removable disk storage, flash memory, random access memory (RAM-SRAM,EDO RAM, SDRAM, DDR SDRAM, RAMBUS RAM, etc.), read only memory (PROM,EEPROM, etc.), and so on. Memory in computer system 2600 is not limitedto primary storage such as memory 2620. Rather, computer system 2600 mayalso include other forms of storage such as cache memory in processorsubsystem 2680 and secondary storage on I/O Devices 2650 (e.g., a harddrive, storage array, etc.). In some embodiments, these other forms ofstorage may also store program instructions executable by processorsubsystem 2680. In some embodiments, program instructions that whenexecuted implement operational entity 110, controller module 120,database 130, authorization service 140, and/or test engine 150 may beincluded/stored within system memory 2620.

I/O interfaces 2640 may be any of various types of interfaces configuredto couple to and communicate with other devices, according to variousembodiments. In one embodiment, I/O interface 2640 is a bridge chip(e.g., Southbridge) from a front-side to one or more back-side buses.I/O interfaces 2640 may be coupled to one or more I/O devices 2650 viaone or more corresponding buses or other interfaces. Examples of I/Odevices 2650 include storage devices (hard drive, optical drive,removable flash drive, storage array, SAN, or their associatedcontroller), network interface devices (e.g., to a local or wide-areanetwork), or other devices (e.g., graphics, user interface devices,etc.). In one embodiment, computer system 2600 is coupled to a networkvia a network interface device 2650 (e.g., configured to communicateover WiFi, Bluetooth, Ethernet, etc.).

Although specific embodiments have been described above, theseembodiments are not intended to limit the scope of the presentdisclosure, even where only a single embodiment is described withrespect to a particular feature. Examples of features provided in thedisclosure are intended to be illustrative rather than restrictiveunless stated otherwise. The above description is intended to cover suchalternatives, modifications, and equivalents as would be apparent to aperson skilled in the art having the benefit of this disclosure.

The scope of the present disclosure includes any feature or combinationof features disclosed herein (either explicitly or implicitly), or anygeneralization thereof, whether or not it mitigates any or all of theproblems addressed herein. Accordingly, new claims may be formulatedduring prosecution of this application (or an application claimingpriority thereto) to any such combination of features. In particular,with reference to the appended claims, features from dependent claimsmay be combined with those of the independent claims and features fromrespective independent claims may be combined in any appropriate mannerand not merely in the specific combinations enumerated in the appendedclaims.

What is claimed is:
 1. A method, comprising: receiving, by a controllermodule executing on a computer system, an instruction that identifies aparticular operational entity to be transitioned from a first state to asecond state as part of automated implementation of an operationalscenario; and causing, by the controller module, the instruction to becarried out for the particular operational entity by making a call to arouting layer, wherein the call does not specify whether the particularoperational entity is remote relative to a local environment of thecontroller module, and wherein the routing layer is operable to make adetermination on whether the particular operational entity is within thelocal environment or remote to the local environment, and wherein therouting layer is operable to use the determination to perform a routingoperation in relation to the particular operational entity.
 2. Themethod of claim 1, wherein the routing layer is operable to access ablueprint for a routable entity associated with the particularoperational entity, and wherein the routing layer is operable todetermine that the particular operational entity is remote to the localenvironment based on whether the blueprint specifies a remote host port.3. The method of claim 2, wherein the routing layer is operable toaccess a blueprint for the particular operational entity that specifiesrelationship information for a relationship that is between theparticular operational entity and the routable entity, and wherein therelationship enables the routing layer to access the blueprint for theroutable entity.
 4. The method of claim 1, wherein the controller moduleis operable to make the same call to the routing layer independent ofwhether the particular operational entity is within the localenvironment or remote to the local environment.
 5. The method of claim1, wherein the call specifies a particular function implemented by theparticular operational entity for carrying out the instruction, andwherein the routing layer is operable to perform the routing operationby invoking the particular function.
 6. The method of claim 1, whereinthe routing layer is operable to perform the routing operation byrouting the instruction to another controller module that manages theparticular operational entity.
 7. The method of claim 6, wherein therouting layer is operable to select a first routing protocol for routingthe instruction to the other controller module based on thedetermination indicating that the particular operational entity isremote to the local environment, wherein the first routing protocol isdifferent than a second routing protocol usable to route instructionswithin the local environment.
 8. The method of claim 1, wherein thecontroller module is included within a hierarchy of components havingcontroller modules and operational entities, and wherein the hierarchyincludes an orchestrator controller module at a top level of thehierarchy that is executable to implement the operational scenario byissuing instructions to controller modules at a next level of thehierarchy.
 9. The method of claim 8, wherein the instruction is receivedby the controller module from the orchestrator controller module as partof implementing the operational scenario that includes starting up adatabase service having a set of database servers capable of performingdatabase transactions on behalf of users of the computer system.
 10. Themethod of claim 9, wherein the call is made to the routing layer tocause the routing layer to invoke a function of the particularoperational entity to instantiate a database server as part of startingup the database service.
 11. A non-transitory computer readable mediumhaving program instructions stored thereon that are capable of causing acomputer system to implement a routing layer capable of performingoperations comprising: receiving a request to route an instruction to aparticular operational entity that is to be transitioned from a firststate to a second state, wherein the request does not specify whetherthe particular operational entity is remote relative to a localenvironment of a controller module from which the request is received;making, based on information maintained for the particular operationalentity, a determination on whether the particular operational entity iswithin the local environment or remote to the local environment; androuting the instruction to the particular operational entity based onthe determination.
 12. The medium of claim 11, wherein the informationdefines a blueprint for the particular operational entity, whereinblueprint defines a relationship between the particular operationalentity and a routable entity that is associated with a second blueprintthat indicates whether the particular operational entity is within thelocal environment or remote to the local environment.
 13. The medium ofclaim 12, wherein the operations further comprise: accessing, based onthe relationship, the second blueprint, wherein making the determinationincludes determining that the particular operational entity is remote tothe local environment based on the accessed second blueprint specifyinga remote host port.
 14. The medium of claim 11, wherein routing theinstruction includes: invoking a particular function that is implementedby the particular operational entity for transitioning the particularoperational entity from the first state to the second state.
 15. Themedium of claim 11, wherein routing the instruction includes: sendingthe instruction to another controller module within a next level of ahierarchy of controllers relative to the controller module from whichthe request is received, wherein the other controller module directlymanages the particular operational entity.
 16. A method, comprising:executing, by a computer system, a hierarchy of components that includecontroller modules and operational entities, wherein the hierarchyincludes an orchestrator controller module at a top level of thehierarchy that is executable to perform an operational scenario byissuing a set of instructions to controller modules at a next level ofthe hierarchy; receiving, by a controller module of the hierarchy thatis executing on the computer system, an instruction that identifies aparticular one of the operational entities that is to be transitionedfrom a first state to a second state; and causing, by the controllermodule, the instruction to be carried out by making a call to a routinglayer, wherein the call does not specify whether the particularoperational entity is remote relative to a local environment of thecontroller module, and wherein the routing layer is operable to make adetermination on whether the particular operational entity is within thelocal environment or remote to the local environment, and wherein therouting layer is operable to use the determination to route theinstruction to the particular operational entity.
 17. The method ofclaim 16, wherein the routing layer is operable to determine that theparticular operational entity is remote to the local environment inresponse to the particular operational entity being associated with aremote host port.
 18. The method of claim 16, wherein the routing layeris operable to utilize a first routing protocol for routing instructionsto operational entities that are remote to the local environment and asecond, different routing protocol for routing instructions tooperational entities that are within the local environment.
 19. Themethod of claim 16, wherein routing the instruction includes routing theinstruction to another controller module within a next level of thehierarchy, wherein the other controller module directly manages theparticular operational entity.
 20. The method of claim 16, wherein thecontroller module is operable make the same call to the routing layerindependent of whether the particular operational entity is remoterelative to the local environment of the controller module.